Merge pull request #31 from ZacharyHampton/v0.3

v0.3
pull/32/head v0.3.0
Zachary Hampton 2023-10-04 19:26:44 -07:00 committed by GitHub
commit 4a1116440d
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
17 changed files with 790 additions and 1280 deletions

206
README.md
View File

@ -1,6 +1,6 @@
<img src="https://github.com/ZacharyHampton/HomeHarvest/assets/78247585/d1a2bf8b-09f5-4c57-b33a-0ada8a34f12d" width="400"> <img src="https://github.com/ZacharyHampton/HomeHarvest/assets/78247585/d1a2bf8b-09f5-4c57-b33a-0ada8a34f12d" width="400">
**HomeHarvest** is a simple, yet comprehensive, real estate scraping library. **HomeHarvest** is a simple, yet comprehensive, real estate scraping library that extracts and formats data in the style of MLS listings.
[![Try with Replit](https://replit.com/badge?caption=Try%20with%20Replit)](https://replit.com/@ZacharyHampton/HomeHarvestDemo) [![Try with Replit](https://replit.com/badge?caption=Try%20with%20Replit)](https://replit.com/@ZacharyHampton/HomeHarvestDemo)
@ -11,10 +11,14 @@
Check out another project we wrote: ***[JobSpy](https://github.com/cullenwatson/JobSpy)** a Python package for job scraping* Check out another project we wrote: ***[JobSpy](https://github.com/cullenwatson/JobSpy)** a Python package for job scraping*
## Features ## HomeHarvest Features
- Scrapes properties from **Zillow**, **Realtor.com** & **Redfin** simultaneously - **Source**: Fetches properties directly from **Realtor.com**.
- Aggregates the properties in a Pandas DataFrame - **Data Format**: Structures data to resemble MLS listings.
- **Export Flexibility**: Options to save as either CSV or Excel.
- **Usage Modes**:
- **CLI**: For users who prefer command-line operations.
- **Python**: For those who'd like to integrate scraping into their Python scripts.
[Video Guide for HomeHarvest](https://youtu.be/JnV7eR2Ve2o) - _updated for release v0.2.7_ [Video Guide for HomeHarvest](https://youtu.be/JnV7eR2Ve2o) - _updated for release v0.2.7_
@ -31,136 +35,150 @@ pip install homeharvest
### CLI ### CLI
```
usage: homeharvest [-l {for_sale,for_rent,sold}] [-o {excel,csv}] [-f FILENAME] [-p PROXY] [-d DAYS] [-r RADIUS] [-m] location
Home Harvest Property Scraper
positional arguments:
location Location to scrape (e.g., San Francisco, CA)
options:
-l {for_sale,for_rent,sold}, --listing_type {for_sale,for_rent,sold}
Listing type to scrape
-o {excel,csv}, --output {excel,csv}
Output format
-f FILENAME, --filename FILENAME
Name of the output file (without extension)
-p PROXY, --proxy PROXY
Proxy to use for scraping
-d DAYS, --days DAYS Sold/listed in last _ days filter.
-r RADIUS, --radius RADIUS
Get comparable properties within _ (eg. 0.0) miles. Only applicable for individual addresses.
-m, --mls_only If set, fetches only MLS listings.
```
```bash ```bash
homeharvest "San Francisco, CA" -s zillow realtor.com redfin -l for_rent -o excel -f HomeHarvest > homeharvest "San Francisco, CA" -l for_rent -o excel -f HomeHarvest
``` ```
This will scrape properties from the specified sites for the given location and listing type, and save the results to an Excel file named `HomeHarvest.xlsx`. ### Python
By default:
- If `-s` or `--site_name` is not provided, it will scrape from all available sites.
- If `-l` or `--listing_type` is left blank, the default is `for_sale`. Other options are `for_rent` or `sold`.
- The `-o` or `--output` default format is `excel`. Options are `csv` or `excel`.
- If `-f` or `--filename` is left blank, the default is `HomeHarvest_<current_timestamp>`.
- If `-p` or `--proxy` is not provided, the scraper uses the local IP.
- Use `-k` or `--keep_duplicates` to keep duplicate properties based on address. If not provided, duplicates will be removed.
### Python
```py ```py
from homeharvest import scrape_property from homeharvest import scrape_property
import pandas as pd from datetime import datetime
properties: pd.DataFrame = scrape_property( # Generate filename based on current timestamp
site_name=["zillow", "realtor.com", "redfin"], current_timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
location="85281", filename = f"output/{current_timestamp}.csv"
listing_type="for_rent" # for_sale / sold
properties = scrape_property(
location="San Diego, CA",
listing_type="sold", # or (for_sale, for_rent)
property_younger_than=30, # sold in last 30 days - listed in last x days if (for_sale, for_rent)
mls_only=True, # only fetch MLS listings
) )
print(f"Number of properties: {len(properties)}")
#: Note, to export to CSV or Excel, use properties.to_csv() or properties.to_excel(). # Export to csv
print(properties) properties.to_csv(filename, index=False)
print(properties.head())
``` ```
## Output ## Output
```py
>>> properties.head()
property_url site_name listing_type apt_min_price apt_max_price ...
0 https://www.redfin.com/AZ/Tempe/1003-W-Washing... redfin for_rent 1666.0 2750.0 ...
1 https://www.redfin.com/AZ/Tempe/VELA-at-Town-L... redfin for_rent 1665.0 3763.0 ...
2 https://www.redfin.com/AZ/Tempe/Camden-Tempe/a... redfin for_rent 1939.0 3109.0 ...
3 https://www.redfin.com/AZ/Tempe/Emerson-Park/a... redfin for_rent 1185.0 1817.0 ...
4 https://www.redfin.com/AZ/Tempe/Rio-Paradiso-A... redfin for_rent 1470.0 2235.0 ...
[5 rows x 41 columns]
```
### Parameters for `scrape_properties()`
```plaintext ```plaintext
Required >>> properties.head()
├── location (str): address in various formats e.g. just zip, full address, city/state, etc. MLS MLS # Status Style ... COEDate LotSFApx PrcSqft Stories
└── listing_type (enum): for_rent, for_sale, sold 0 SDCA 230018348 SOLD CONDOS ... 2023-10-03 290110 803 2
Optional 1 SDCA 230016614 SOLD TOWNHOMES ... 2023-10-03 None 838 3
├── site_name (list[enum], default=all three sites): zillow, realtor.com, redfin 2 SDCA 230016367 SOLD CONDOS ... 2023-10-03 30056 649 1
├── proxy (str): in format 'http://user:pass@host:port' or [https, socks] 3 MRCA NDP2306335 SOLD SINGLE_FAMILY ... 2023-10-03 7519 661 2
└── keep_duplicates (bool, default=False): whether to keep or remove duplicate properties based on address 4 SDCA 230014532 SOLD CONDOS ... 2023-10-03 None 752 1
[5 rows x 22 columns]
``` ```
### Parameters for `scrape_property()`
```
Required
├── location (str): The address in various formats - this could be just a zip code, a full address, or city/state, etc.
└── listing_type (option): Choose the type of listing.
- 'for_rent'
- 'for_sale'
- 'sold'
Optional
├── radius (decimal): Radius in miles to find comparable properties based on individual addresses.
│ Example: 5.5 (fetches properties within a 5.5-mile radius if location is set to a specific address; otherwise, ignored)
├── property_younger_than (integer): Number of past days to filter properties. Utilizes 'last_sold_date' for 'sold' listing types, and 'list_date' for others (for_rent, for_sale).
│ Example: 30 (fetches properties listed/sold in the last 30 days)
├── mls_only (True/False): If set, fetches only MLS listings (mainly applicable to 'sold' listings)
└── proxy (string): In format 'http://user:pass@host:port'
```
### Property Schema ### Property Schema
```plaintext ```plaintext
Property Property
├── Basic Information: ├── Basic Information:
│ ├── property_url (str) │ ├── property_url
│ ├── site_name (enum): zillow, redfin, realtor.com ├── mls
│ ├── listing_type (enum): for_sale, for_rent, sold ├── mls_id
│ └── property_type (enum): house, apartment, condo, townhouse, single_family, multi_family, building └── status
├── Address Details: ├── Address Details:
│ ├── street_address (str) │ ├── street
│ ├── city (str) │ ├── unit
│ ├── state (str) │ ├── city
│ ├── zip_code (str) │ ├── state
│ ├── unit (str) │ └── zip_code
│ └── country (str)
├── House for Sale Features: ├── Property Description:
│ ├── tax_assessed_value (int) │ ├── style
│ ├── lot_area_value (float) │ ├── beds
│ ├── lot_area_unit (str) │ ├── full_baths
│ ├── stories (int) │ ├── half_baths
│ ├── year_built (int) │ ├── sqft
│ └── price_per_sqft (int) │ ├── year_built
│ ├── stories
│ └── lot_sqft
├── Building for Sale and Apartment Details: ├── Property Listing Details:
│ ├── bldg_name (str) │ ├── list_price
│ ├── beds_min (int) │ ├── list_date
│ ├── beds_max (int) │ ├── sold_price
│ ├── baths_min (float) │ ├── last_sold_date
│ ├── baths_max (float) │ ├── price_per_sqft
│ ├── sqft_min (int) │ └── hoa_fee
│ ├── sqft_max (int)
│ ├── price_min (int)
│ ├── price_max (int)
│ ├── area_min (int)
│ └── unit_count (int)
├── Miscellaneous Details: ├── Location Details:
│ ├── mls_id (str) │ ├── latitude
│ ├── agent_name (str) │ ├── longitude
│ ├── img_src (str)
│ ├── description (str)
│ ├── status_text (str)
│ └── posted_time (str)
└── Location Details: └── Parking Details:
├── latitude (float) └── parking_garage
└── longitude (float)
``` ```
## Supported Countries for Property Scraping
* **Zillow**: contains listings in the **US** & **Canada**
* **Realtor.com**: mainly from the **US** but also has international listings
* **Redfin**: listings mainly in the **US**, **Canada**, & has expanded to some areas in **Mexico**
### Exceptions ### Exceptions
The following exceptions may be raised when using HomeHarvest: The following exceptions may be raised when using HomeHarvest:
- `InvalidSite` - valid options: `zillow`, `redfin`, `realtor.com`
- `InvalidListingType` - valid options: `for_sale`, `for_rent`, `sold` - `InvalidListingType` - valid options: `for_sale`, `for_rent`, `sold`
- `NoResultsFound` - no properties found from your input - `NoResultsFound` - no properties found from your search
- `GeoCoordsNotFound` - if Zillow scraper is not able to derive geo-coordinates from the location you input
## Frequently Asked Questions ## Frequently Asked Questions
--- ---
**Q: Encountering issues with your queries?** **Q: Encountering issues with your searches?**
**A:** Try a single site and/or broaden the location. If problems persist, [submit an issue](https://github.com/ZacharyHampton/HomeHarvest/issues). **A:** Try to broaden the parameters you're using. If problems persist, [submit an issue](https://github.com/ZacharyHampton/HomeHarvest/issues).
--- ---
**Q: Received a Forbidden 403 response code?** **Q: Received a Forbidden 403 response code?**
**A:** This indicates that you have been blocked by the real estate site for sending too many requests. Currently, **Zillow** is particularly aggressive with blocking. We recommend: **A:** This indicates that you have been blocked by Realtor.com for sending too many requests. We recommend:
- Waiting a few seconds between requests. - Waiting a few seconds between requests.
- Trying a VPN to change your IP address. - Trying a VPN or useing a proxy as a parameter to scrape_property() to change your IP address.
--- ---

View File

@ -31,7 +31,7 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"# scrapes all 3 sites by default\n", "# check for sale properties\n",
"scrape_property(\n", "scrape_property(\n",
" location=\"dallas\",\n", " location=\"dallas\",\n",
" listing_type=\"for_sale\"\n", " listing_type=\"for_sale\"\n",
@ -53,7 +53,6 @@
"# search a specific address\n", "# search a specific address\n",
"scrape_property(\n", "scrape_property(\n",
" location=\"2530 Al Lipscomb Way\",\n", " location=\"2530 Al Lipscomb Way\",\n",
" site_name=\"zillow\",\n",
" listing_type=\"for_sale\"\n", " listing_type=\"for_sale\"\n",
")" ")"
] ]
@ -68,7 +67,6 @@
"# check rentals\n", "# check rentals\n",
"scrape_property(\n", "scrape_property(\n",
" location=\"chicago, illinois\",\n", " location=\"chicago, illinois\",\n",
" site_name=[\"redfin\", \"zillow\"],\n",
" listing_type=\"for_rent\"\n", " listing_type=\"for_rent\"\n",
")" ")"
] ]
@ -88,7 +86,6 @@
"# check sold properties\n", "# check sold properties\n",
"scrape_property(\n", "scrape_property(\n",
" location=\"90210\",\n", " location=\"90210\",\n",
" site_name=[\"redfin\"],\n",
" listing_type=\"sold\"\n", " listing_type=\"sold\"\n",
")" ")"
] ]

View File

@ -0,0 +1,18 @@
from homeharvest import scrape_property
from datetime import datetime
# Generate filename based on current timestamp
current_timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"output/{current_timestamp}.csv"
properties = scrape_property(
location="San Diego, CA",
listing_type="sold", # for_sale, for_rent
property_younger_than=30, # sold/listed in last 30 days
mls_only=True, # only fetch MLS listings
)
print(f"Number of properties: {len(properties)}")
# Export to csv
properties.to_csv(filename, index=False)
print(properties.head())

View File

@ -1,187 +1,50 @@
import warnings
import pandas as pd import pandas as pd
from typing import Union
import concurrent.futures
from concurrent.futures import ThreadPoolExecutor
from .core.scrapers import ScraperInput from .core.scrapers import ScraperInput
from .core.scrapers.redfin import RedfinScraper from .utils import process_result, ordered_properties, validate_input
from .core.scrapers.realtor import RealtorScraper from .core.scrapers.realtor import RealtorScraper
from .core.scrapers.zillow import ZillowScraper from .core.scrapers.models import ListingType
from .core.scrapers.models import ListingType, Property, SiteName from .exceptions import InvalidListingType, NoResultsFound
from .exceptions import InvalidSite, InvalidListingType
_scrapers = {
"redfin": RedfinScraper,
"realtor.com": RealtorScraper,
"zillow": ZillowScraper,
}
def _validate_input(site_name: str, listing_type: str) -> None:
if site_name.lower() not in _scrapers:
raise InvalidSite(f"Provided site, '{site_name}', does not exist.")
if listing_type.upper() not in ListingType.__members__:
raise InvalidListingType(f"Provided listing type, '{listing_type}', does not exist.")
def _get_ordered_properties(result: Property) -> list[str]:
return [
"property_url",
"site_name",
"listing_type",
"property_type",
"status_text",
"baths_min",
"baths_max",
"beds_min",
"beds_max",
"sqft_min",
"sqft_max",
"price_min",
"price_max",
"unit_count",
"tax_assessed_value",
"price_per_sqft",
"lot_area_value",
"lot_area_unit",
"address_one",
"address_two",
"city",
"state",
"zip_code",
"posted_time",
"area_min",
"bldg_name",
"stories",
"year_built",
"agent_name",
"agent_phone",
"agent_email",
"days_on_market",
"sold_date",
"mls_id",
"img_src",
"latitude",
"longitude",
"description",
]
def _process_result(result: Property) -> pd.DataFrame:
prop_data = result.__dict__
prop_data["site_name"] = prop_data["site_name"].value
prop_data["listing_type"] = prop_data["listing_type"].value.lower()
if "property_type" in prop_data and prop_data["property_type"] is not None:
prop_data["property_type"] = prop_data["property_type"].value.lower()
else:
prop_data["property_type"] = None
if "address" in prop_data:
address_data = prop_data["address"]
prop_data["address_one"] = address_data.address_one
prop_data["address_two"] = address_data.address_two
prop_data["city"] = address_data.city
prop_data["state"] = address_data.state
prop_data["zip_code"] = address_data.zip_code
del prop_data["address"]
if "agent" in prop_data and prop_data["agent"] is not None:
agent_data = prop_data["agent"]
prop_data["agent_name"] = agent_data.name
prop_data["agent_phone"] = agent_data.phone
prop_data["agent_email"] = agent_data.email
del prop_data["agent"]
else:
prop_data["agent_name"] = None
prop_data["agent_phone"] = None
prop_data["agent_email"] = None
properties_df = pd.DataFrame([prop_data])
properties_df = properties_df[_get_ordered_properties(result)]
return properties_df
def _scrape_single_site(location: str, site_name: str, listing_type: str, proxy: str = None) -> pd.DataFrame:
"""
Helper function to scrape a single site.
"""
_validate_input(site_name, listing_type)
scraper_input = ScraperInput(
location=location,
listing_type=ListingType[listing_type.upper()],
site_name=SiteName.get_by_value(site_name.lower()),
proxy=proxy,
)
site = _scrapers[site_name.lower()](scraper_input)
results = site.search()
properties_dfs = [_process_result(result) for result in results]
properties_dfs = [df.dropna(axis=1, how="all") for df in properties_dfs if not df.empty]
if not properties_dfs:
return pd.DataFrame()
return pd.concat(properties_dfs, ignore_index=True)
def scrape_property( def scrape_property(
location: str, location: str,
site_name: Union[str, list[str]] = None,
listing_type: str = "for_sale", listing_type: str = "for_sale",
radius: float = None,
mls_only: bool = False,
property_younger_than: int = None,
pending_or_contingent: bool = False,
proxy: str = None, proxy: str = None,
keep_duplicates: bool = False
) -> pd.DataFrame: ) -> pd.DataFrame:
""" """
Scrape property from various sites from a given location and listing type. Scrape properties from Realtor.com based on a given location and listing type.
:param location: Location to search (e.g. "Dallas, TX", "85281", "2530 Al Lipscomb Way")
:returns: pd.DataFrame :param listing_type: Listing Type (for_sale, for_rent, sold)
:param location: US Location (e.g. 'San Francisco, CA', 'Cook County, IL', '85281', '2530 Al Lipscomb Way') :param radius: Get properties within _ (e.g. 1.0) miles. Only applicable for individual addresses.
:param site_name: Site name or list of site names (e.g. ['realtor.com', 'zillow'], 'redfin') :param mls_only: If set, fetches only listings with MLS IDs.
:param listing_type: Listing type (e.g. 'for_sale', 'for_rent', 'sold') :param property_younger_than: Get properties sold/listed in last _ days.
:return: pd.DataFrame containing properties :param pending_or_contingent: If set, fetches only pending or contingent listings. Only applicable for for_sale listings from general area searches.
:param proxy: Proxy to use for scraping
""" """
if site_name is None: validate_input(listing_type)
site_name = list(_scrapers.keys())
if not isinstance(site_name, list): scraper_input = ScraperInput(
site_name = [site_name] location=location,
listing_type=ListingType[listing_type.upper()],
proxy=proxy,
radius=radius,
mls_only=mls_only,
last_x_days=property_younger_than,
pending_or_contingent=pending_or_contingent,
)
results = [] site = RealtorScraper(scraper_input)
results = site.search()
if len(site_name) == 1: properties_dfs = [process_result(result) for result in results]
final_df = _scrape_single_site(location, site_name[0], listing_type, proxy) if not properties_dfs:
results.append(final_df) raise NoResultsFound("no results found for the query")
else:
with ThreadPoolExecutor() as executor:
futures = {
executor.submit(_scrape_single_site, location, s_name, listing_type, proxy): s_name
for s_name in site_name
}
for future in concurrent.futures.as_completed(futures): with warnings.catch_warnings():
result = future.result() warnings.simplefilter("ignore", category=FutureWarning)
results.append(result) return pd.concat(properties_dfs, ignore_index=True, axis=0)[ordered_properties]
results = [df for df in results if not df.empty and not df.isna().all().all()]
if not results:
return pd.DataFrame()
final_df = pd.concat(results, ignore_index=True)
columns_to_track = ["address_one", "address_two", "city"]
#: validate they exist, otherwise create them
for col in columns_to_track:
if col not in final_df.columns:
final_df[col] = None
if not keep_duplicates:
final_df = final_df.drop_duplicates(subset=columns_to_track, keep="first")
return final_df

View File

@ -5,15 +5,8 @@ from homeharvest import scrape_property
def main(): def main():
parser = argparse.ArgumentParser(description="Home Harvest Property Scraper") parser = argparse.ArgumentParser(description="Home Harvest Property Scraper")
parser.add_argument("location", type=str, help="Location to scrape (e.g., San Francisco, CA)")
parser.add_argument( parser.add_argument(
"-s", "location", type=str, help="Location to scrape (e.g., San Francisco, CA)"
"--site_name",
type=str,
nargs="*",
default=None,
help="Site name(s) to scrape from (e.g., realtor, zillow)",
) )
parser.add_argument( parser.add_argument(
@ -43,17 +36,40 @@ def main():
) )
parser.add_argument( parser.add_argument(
"-k", "-p", "--proxy", type=str, default=None, help="Proxy to use for scraping"
"--keep_duplicates", )
action="store_true", parser.add_argument(
help="Keep duplicate properties based on address" "-d",
"--days",
type=int,
default=None,
help="Sold/listed in last _ days filter.",
) )
parser.add_argument("-p", "--proxy", type=str, default=None, help="Proxy to use for scraping") parser.add_argument(
"-r",
"--radius",
type=float,
default=None,
help="Get comparable properties within _ (eg. 0.0) miles. Only applicable for individual addresses.",
)
parser.add_argument(
"-m",
"--mls_only",
action="store_true",
help="If set, fetches only MLS listings.",
)
args = parser.parse_args() args = parser.parse_args()
result = scrape_property(args.location, args.site_name, args.listing_type, proxy=args.proxy, keep_duplicates=args.keep_duplicates) result = scrape_property(
args.location,
args.listing_type,
radius=args.radius,
proxy=args.proxy,
mls_only=args.mls_only,
property_younger_than=args.days,
)
if not args.filename: if not args.filename:
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S") timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")

View File

@ -8,12 +8,19 @@ from .models import Property, ListingType, SiteName
class ScraperInput: class ScraperInput:
location: str location: str
listing_type: ListingType listing_type: ListingType
site_name: SiteName radius: float | None = None
mls_only: bool | None = None
proxy: str | None = None proxy: str | None = None
last_x_days: int | None = None
pending_or_contingent: bool | None = None
class Scraper: class Scraper:
def __init__(self, scraper_input: ScraperInput, session: requests.Session | tls_client.Session = None): def __init__(
self,
scraper_input: ScraperInput,
session: requests.Session | tls_client.Session = None,
):
self.location = scraper_input.location self.location = scraper_input.location
self.listing_type = scraper_input.listing_type self.listing_type = scraper_input.listing_type
@ -28,7 +35,10 @@ class Scraper:
self.session.proxies.update(proxies) self.session.proxies.update(proxies)
self.listing_type = scraper_input.listing_type self.listing_type = scraper_input.listing_type
self.site_name = scraper_input.site_name self.radius = scraper_input.radius
self.last_x_days = scraper_input.last_x_days
self.mls_only = scraper_input.mls_only
self.pending_or_contingent = scraper_input.pending_or_contingent
def search(self) -> list[Property]: def search(self) -> list[Property]:
... ...

View File

@ -1,7 +1,6 @@
from dataclasses import dataclass from dataclasses import dataclass
from enum import Enum from enum import Enum
from typing import Tuple from typing import Optional
from datetime import datetime
class SiteName(Enum): class SiteName(Enum):
@ -23,98 +22,44 @@ class ListingType(Enum):
SOLD = "SOLD" SOLD = "SOLD"
class PropertyType(Enum):
HOUSE = "HOUSE"
BUILDING = "BUILDING"
CONDO = "CONDO"
TOWNHOUSE = "TOWNHOUSE"
SINGLE_FAMILY = "SINGLE_FAMILY"
MULTI_FAMILY = "MULTI_FAMILY"
MANUFACTURED = "MANUFACTURED"
NEW_CONSTRUCTION = "NEW_CONSTRUCTION"
APARTMENT = "APARTMENT"
APARTMENTS = "APARTMENTS"
LAND = "LAND"
LOT = "LOT"
OTHER = "OTHER"
BLANK = "BLANK"
@classmethod
def from_int_code(cls, code):
mapping = {
1: cls.HOUSE,
2: cls.CONDO,
3: cls.TOWNHOUSE,
4: cls.MULTI_FAMILY,
5: cls.LAND,
6: cls.OTHER,
8: cls.SINGLE_FAMILY,
13: cls.SINGLE_FAMILY,
}
return mapping.get(code, cls.BLANK)
@dataclass @dataclass
class Address: class Address:
address_one: str | None = None street: str | None = None
address_two: str | None = "#" unit: str | None = None
city: str | None = None city: str | None = None
state: str | None = None state: str | None = None
zip_code: str | None = None zip: str | None = None
@dataclass @dataclass
class Agent: class Description:
name: str style: str | None = None
phone: str | None = None beds: int | None = None
email: str | None = None baths_full: int | None = None
baths_half: int | None = None
sqft: int | None = None
lot_sqft: int | None = None
sold_price: int | None = None
year_built: int | None = None
garage: float | None = None
stories: int | None = None
@dataclass @dataclass
class Property: class Property:
property_url: str property_url: str
site_name: SiteName mls: str | None = None
listing_type: ListingType
address: Address
property_type: PropertyType | None = None
# house for sale
tax_assessed_value: int | None = None
lot_area_value: float | None = None
lot_area_unit: str | None = None
stories: int | None = None
year_built: int | None = None
price_per_sqft: int | None = None
mls_id: str | None = None mls_id: str | None = None
status: str | None = None
address: Address | None = None
agent: Agent | None = None list_price: int | None = None
img_src: str | None = None list_date: str | None = None
description: str | None = None last_sold_date: str | None = None
status_text: str | None = None prc_sqft: int | None = None
posted_time: datetime | None = None hoa_fee: int | None = None
description: Description | None = None
# building for sale
bldg_name: str | None = None
area_min: int | None = None
beds_min: int | None = None
beds_max: int | None = None
baths_min: float | None = None
baths_max: float | None = None
sqft_min: int | None = None
sqft_max: int | None = None
price_min: int | None = None
price_max: int | None = None
unit_count: int | None = None
latitude: float | None = None latitude: float | None = None
longitude: float | None = None longitude: float | None = None
neighborhoods: Optional[str] = None
sold_date: datetime | None = None
days_on_market: int | None = None

View File

@ -2,39 +2,26 @@
homeharvest.realtor.__init__ homeharvest.realtor.__init__
~~~~~~~~~~~~ ~~~~~~~~~~~~
This module implements the scraper for relator.com This module implements the scraper for realtor.com
""" """
from ..models import Property, Address from typing import Dict, Union, Optional
from concurrent.futures import ThreadPoolExecutor, as_completed
from .. import Scraper from .. import Scraper
from ....exceptions import NoResultsFound from ....exceptions import NoResultsFound
from ....utils import parse_address_one, parse_address_two from ..models import Property, Address, ListingType, Description
from concurrent.futures import ThreadPoolExecutor, as_completed
class RealtorScraper(Scraper): class RealtorScraper(Scraper):
SEARCH_GQL_URL = "https://www.realtor.com/api/v1/rdc_search_srp?client_id=rdc-search-new-communities&schema=vesta"
PROPERTY_URL = "https://www.realtor.com/realestateandhomes-detail/"
ADDRESS_AUTOCOMPLETE_URL = "https://parser-external.geo.moveaws.com/suggest"
def __init__(self, scraper_input): def __init__(self, scraper_input):
self.counter = 1 self.counter = 1
super().__init__(scraper_input) super().__init__(scraper_input)
self.search_url = (
"https://www.realtor.com/api/v1/rdc_search_srp?client_id=rdc-search-new-communities&schema=vesta"
)
def handle_location(self): def handle_location(self):
headers = {
"authority": "parser-external.geo.moveaws.com",
"accept": "*/*",
"accept-language": "en-US,en;q=0.9",
"origin": "https://www.realtor.com",
"referer": "https://www.realtor.com/",
"sec-ch-ua": '"Chromium";v="116", "Not)A;Brand";v="24", "Google Chrome";v="116"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"Windows"',
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "cross-site",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36",
}
params = { params = {
"input": self.location, "input": self.location,
"client_id": self.listing_type.value.lower().replace("_", "-"), "client_id": self.listing_type.value.lower().replace("_", "-"),
@ -43,9 +30,8 @@ class RealtorScraper(Scraper):
} }
response = self.session.get( response = self.session.get(
"https://parser-external.geo.moveaws.com/suggest", self.ADDRESS_AUTOCOMPLETE_URL,
params=params, params=params,
headers=headers,
) )
response_json = response.json() response_json = response.json()
@ -56,6 +42,145 @@ class RealtorScraper(Scraper):
return result[0] return result[0]
def handle_listing(self, listing_id: str) -> list[Property]:
query = """query Listing($listing_id: ID!) {
listing(id: $listing_id) {
source {
id
listing_id
}
address {
street_number
street_name
street_suffix
unit
city
state_code
postal_code
location {
coordinate {
lat
lon
}
}
}
basic {
sqft
beds
baths_full
baths_half
lot_sqft
sold_price
sold_price
type
price
status
sold_date
list_date
}
details {
year_built
stories
garage
permalink
}
}
}"""
variables = {"listing_id": listing_id}
payload = {
"query": query,
"variables": variables,
}
response = self.session.post(self.SEARCH_GQL_URL, json=payload)
response_json = response.json()
property_info = response_json["data"]["listing"]
mls = (
property_info["source"].get("id")
if "source" in property_info and isinstance(property_info["source"], dict)
else None
)
able_to_get_lat_long = (
property_info
and property_info.get("address")
and property_info["address"].get("location")
and property_info["address"]["location"].get("coordinate")
)
listing = Property(
mls=mls,
mls_id=property_info["source"].get("listing_id")
if "source" in property_info and isinstance(property_info["source"], dict)
else None,
property_url=f"{self.PROPERTY_URL}{property_info['details']['permalink']}",
status=property_info["basic"]["status"].upper(),
list_price=property_info["basic"]["price"],
list_date=property_info["basic"]["list_date"].split("T")[0]
if property_info["basic"].get("list_date")
else None,
prc_sqft=property_info["basic"].get("price") / property_info["basic"].get("sqft")
if property_info["basic"].get("price") and property_info["basic"].get("sqft")
else None,
last_sold_date=property_info["basic"]["sold_date"].split("T")[0]
if property_info["basic"].get("sold_date")
else None,
latitude=property_info["address"]["location"]["coordinate"].get("lat")
if able_to_get_lat_long
else None,
longitude=property_info["address"]["location"]["coordinate"].get("lon")
if able_to_get_lat_long
else None,
address=self._parse_address(property_info, search_type="handle_listing"),
description=Description(
style=property_info["basic"].get("type", "").upper(),
beds=property_info["basic"].get("beds"),
baths_full=property_info["basic"].get("baths_full"),
baths_half=property_info["basic"].get("baths_half"),
sqft=property_info["basic"].get("sqft"),
lot_sqft=property_info["basic"].get("lot_sqft"),
sold_price=property_info["basic"].get("sold_price"),
year_built=property_info["details"].get("year_built"),
garage=property_info["details"].get("garage"),
stories=property_info["details"].get("stories"),
)
)
return [listing]
def get_latest_listing_id(self, property_id: str) -> str | None:
query = """query Property($property_id: ID!) {
property(id: $property_id) {
listings {
listing_id
primary
}
}
}
"""
variables = {"property_id": property_id}
payload = {
"query": query,
"variables": variables,
}
response = self.session.post(self.SEARCH_GQL_URL, json=payload)
response_json = response.json()
property_info = response_json["data"]["property"]
if property_info["listings"] is None:
return None
primary_listing = next((listing for listing in property_info["listings"] if listing["primary"]), None)
if primary_listing:
return primary_listing["listing_id"]
else:
return property_info["listings"][0]["listing_id"]
def handle_address(self, property_id: str) -> list[Property]: def handle_address(self, property_id: str) -> list[Property]:
""" """
Handles a specific address & returns one property Handles a specific address & returns one property
@ -71,22 +196,19 @@ class RealtorScraper(Scraper):
stories stories
} }
address { address {
address_validation_code
city
country
county
line
postal_code
state_code
street_direction
street_name
street_number street_number
street_name
street_suffix street_suffix
street_post_direction
unit_value
unit unit
unit_descriptor city
zip state_code
postal_code
location {
coordinate {
lat
lon
}
}
} }
basic { basic {
baths baths
@ -114,127 +236,175 @@ class RealtorScraper(Scraper):
"variables": variables, "variables": variables,
} }
response = self.session.post(self.search_url, json=payload) response = self.session.post(self.SEARCH_GQL_URL, json=payload)
response_json = response.json() response_json = response.json()
property_info = response_json["data"]["property"] property_info = response_json["data"]["property"]
address_one, address_two = parse_address_one(property_info["address"]["line"])
return [ return [
Property( Property(
site_name=self.site_name,
address=Address(
address_one=address_one,
address_two=address_two,
city=property_info["address"]["city"],
state=property_info["address"]["state_code"],
zip_code=property_info["address"]["postal_code"],
),
property_url="https://www.realtor.com/realestateandhomes-detail/"
+ property_info["details"]["permalink"],
stories=property_info["details"]["stories"],
year_built=property_info["details"]["year_built"],
price_per_sqft=property_info["basic"]["price"] // property_info["basic"]["sqft"]
if property_info["basic"]["sqft"] is not None and property_info["basic"]["price"] is not None
else None,
mls_id=property_id, mls_id=property_id,
listing_type=self.listing_type, property_url=f"{self.PROPERTY_URL}{property_info['details']['permalink']}",
lot_area_value=property_info["public_record"]["lot_size"] address=self._parse_address(
if property_info["public_record"] is not None property_info, search_type="handle_address"
else None, ),
beds_min=property_info["basic"]["beds"], description=self._parse_description(property_info),
beds_max=property_info["basic"]["beds"],
baths_min=property_info["basic"]["baths"],
baths_max=property_info["basic"]["baths"],
sqft_min=property_info["basic"]["sqft"],
sqft_max=property_info["basic"]["sqft"],
price_min=property_info["basic"]["price"],
price_max=property_info["basic"]["price"],
) )
] ]
def handle_area(self, variables: dict, return_total: bool = False) -> list[Property] | int: def general_search(
self, variables: dict, search_type: str
) -> Dict[str, Union[int, list[Property]]]:
""" """
Handles a location area & returns a list of properties Handles a location area & returns a list of properties
""" """
query = ( results_query = """{
"""query Home_search( count
$city: String, total
$county: [String], results {
$state_code: String, property_id
$postal_code: String list_date
$offset: Int, status
) { last_sold_price
home_search( last_sold_date
query: { list_price
city: $city price_per_sqft
county: $county description {
postal_code: $postal_code sqft
state_code: $state_code beds
status: %s baths_full
baths_half
lot_sqft
sold_price
year_built
garage
sold_price
type
name
stories
} }
limit: 200 source {
offset: $offset id
) { listing_id
count }
total hoa {
results { fee
property_id }
description { location {
baths address {
beds street_number
lot_sqft street_name
sqft street_suffix
text unit
sold_price city
stories state_code
year_built postal_code
garage coordinate {
unit_number lon
floor_number lat
}
location {
address {
city
country
line
postal_code
state_code
state
street_direction
street_name
street_number
street_post_direction
street_suffix
unit
coordinate {
lon
lat
}
} }
} }
list_price neighborhoods {
price_per_sqft name
source {
id
} }
} }
} }
}""" }
% self.listing_type.value.lower() }"""
date_param = (
'sold_date: { min: "$today-%sD" }' % self.last_x_days
if self.listing_type == ListingType.SOLD and self.last_x_days
else (
'list_date: { min: "$today-%sD" }' % self.last_x_days
if self.last_x_days
else ""
)
) )
sort_param = (
"sort: [{ field: sold_date, direction: desc }]"
if self.listing_type == ListingType.SOLD
else "sort: [{ field: list_date, direction: desc }]"
)
pending_or_contingent_param = "or_filters: { contingent: true, pending: true }" if self.pending_or_contingent else ""
if search_type == "comps": #: comps search, came from an address
query = """query Property_search(
$coordinates: [Float]!
$radius: String!
$offset: Int!,
) {
property_search(
query: {
nearby: {
coordinates: $coordinates
radius: $radius
}
status: %s
%s
}
%s
limit: 200
offset: $offset
) %s""" % (
self.listing_type.value.lower(),
date_param,
sort_param,
results_query,
)
elif search_type == "area": #: general search, came from a general location
query = """query Home_search(
$city: String,
$county: [String],
$state_code: String,
$postal_code: String
$offset: Int,
) {
home_search(
query: {
city: $city
county: $county
postal_code: $postal_code
state_code: $state_code
status: %s
%s
%s
}
%s
limit: 200
offset: $offset
) %s""" % (
self.listing_type.value.lower(),
date_param,
pending_or_contingent_param,
sort_param,
results_query,
)
else: #: general search, came from an address
query = (
"""query Property_search(
$property_id: [ID]!
$offset: Int!,
) {
property_search(
query: {
property_id: $property_id
}
limit: 1
offset: $offset
) %s""" % results_query)
payload = { payload = {
"query": query, "query": query,
"variables": variables, "variables": variables,
} }
response = self.session.post(self.search_url, json=payload) response = self.session.post(self.SEARCH_GQL_URL, json=payload)
response.raise_for_status() response.raise_for_status()
response_json = response.json() response_json = response.json()
search_key = "home_search" if search_type == "area" else "property_search"
if return_total:
return response_json["data"]["home_search"]["total"]
properties: list[Property] = [] properties: list[Property] = []
@ -242,89 +412,164 @@ class RealtorScraper(Scraper):
response_json is None response_json is None
or "data" not in response_json or "data" not in response_json
or response_json["data"] is None or response_json["data"] is None
or "home_search" not in response_json["data"] or search_key not in response_json["data"]
or response_json["data"]["home_search"] is None or response_json["data"][search_key] is None
or "results" not in response_json["data"]["home_search"] or "results" not in response_json["data"][search_key]
): ):
return [] return {"total": 0, "properties": []}
for result in response_json["data"]["home_search"]["results"]: for result in response_json["data"][search_key]["results"]:
self.counter += 1 self.counter += 1
address_one, _ = parse_address_one(result["location"]["address"]["line"]) mls = (
result["source"].get("id")
if "source" in result and isinstance(result["source"], dict)
else None
)
if not mls and self.mls_only:
continue
able_to_get_lat_long = (
result
and result.get("location")
and result["location"].get("address")
and result["location"]["address"].get("coordinate")
)
realty_property = Property( realty_property = Property(
address=Address( mls=mls,
address_one=address_one, mls_id=result["source"].get("listing_id")
city=result["location"]["address"]["city"], if "source" in result and isinstance(result["source"], dict)
state=result["location"]["address"]["state_code"],
zip_code=result["location"]["address"]["postal_code"],
address_two=parse_address_two(result["location"]["address"]["unit"]),
),
latitude=result["location"]["address"]["coordinate"]["lat"]
if result
and result.get("location")
and result["location"].get("address")
and result["location"]["address"].get("coordinate")
and "lat" in result["location"]["address"]["coordinate"]
else None, else None,
longitude=result["location"]["address"]["coordinate"]["lon"] property_url=f"{self.PROPERTY_URL}{result['property_id']}",
if result status=result["status"].upper(),
and result.get("location") list_price=result["list_price"],
and result["location"].get("address") list_date=result["list_date"].split("T")[0]
and result["location"]["address"].get("coordinate") if result.get("list_date")
and "lon" in result["location"]["address"]["coordinate"]
else None, else None,
site_name=self.site_name, prc_sqft=result.get("price_per_sqft"),
property_url="https://www.realtor.com/realestateandhomes-detail/" + result["property_id"], last_sold_date=result.get("last_sold_date"),
stories=result["description"]["stories"], hoa_fee=result["hoa"]["fee"]
year_built=result["description"]["year_built"], if result.get("hoa") and isinstance(result["hoa"], dict)
price_per_sqft=result["price_per_sqft"], else None,
mls_id=result["property_id"], latitude=result["location"]["address"]["coordinate"].get("lat")
listing_type=self.listing_type, if able_to_get_lat_long
lot_area_value=result["description"]["lot_sqft"], else None,
beds_min=result["description"]["beds"], longitude=result["location"]["address"]["coordinate"].get("lon")
beds_max=result["description"]["beds"], if able_to_get_lat_long
baths_min=result["description"]["baths"], else None,
baths_max=result["description"]["baths"], address=self._parse_address(result, search_type="general_search"),
sqft_min=result["description"]["sqft"], #: neighborhoods=self._parse_neighborhoods(result),
sqft_max=result["description"]["sqft"], description=self._parse_description(result),
price_min=result["list_price"],
price_max=result["list_price"],
) )
properties.append(realty_property) properties.append(realty_property)
return properties return {
"total": response_json["data"][search_key]["total"],
"properties": properties,
}
def search(self): def search(self):
location_info = self.handle_location() location_info = self.handle_location()
location_type = location_info["area_type"] location_type = location_info["area_type"]
if location_type == "address":
property_id = location_info["mpr_id"]
return self.handle_address(property_id)
offset = 0
search_variables = { search_variables = {
"city": location_info.get("city"), "offset": 0,
"county": location_info.get("county"),
"state_code": location_info.get("state_code"),
"postal_code": location_info.get("postal_code"),
"offset": offset,
} }
total = self.handle_area(search_variables, return_total=True) search_type = "comps" if self.radius and location_type == "address" else "address" if location_type == "address" and not self.radius else "area"
if location_type == "address":
if not self.radius: #: single address search, non comps
property_id = location_info["mpr_id"]
search_variables |= {"property_id": property_id}
gql_results = self.general_search(search_variables, search_type=search_type)
if gql_results["total"] == 0:
listing_id = self.get_latest_listing_id(property_id)
if listing_id is None:
return self.handle_address(property_id)
else:
return self.handle_listing(listing_id)
else:
return gql_results["properties"]
else: #: general search, comps (radius)
coordinates = list(location_info["centroid"].values())
search_variables |= {
"coordinates": coordinates,
"radius": "{}mi".format(self.radius),
}
else: #: general search, location
search_variables |= {
"city": location_info.get("city"),
"county": location_info.get("county"),
"state_code": location_info.get("state_code"),
"postal_code": location_info.get("postal_code"),
}
result = self.general_search(search_variables, search_type=search_type)
total = result["total"]
homes = result["properties"]
homes = []
with ThreadPoolExecutor(max_workers=10) as executor: with ThreadPoolExecutor(max_workers=10) as executor:
futures = [ futures = [
executor.submit( executor.submit(
self.handle_area, self.general_search,
variables=search_variables | {"offset": i}, variables=search_variables | {"offset": i},
return_total=False, search_type=search_type,
) )
for i in range(0, total, 200) for i in range(200, min(total, 10000), 200)
] ]
for future in as_completed(futures): for future in as_completed(futures):
homes.extend(future.result()) homes.extend(future.result()["properties"])
return homes return homes
@staticmethod
def _parse_neighborhoods(result: dict) -> Optional[str]:
neighborhoods_list = []
neighborhoods = result["location"].get("neighborhoods", [])
if neighborhoods:
for neighborhood in neighborhoods:
name = neighborhood.get("name")
if name:
neighborhoods_list.append(name)
return ", ".join(neighborhoods_list) if neighborhoods_list else None
@staticmethod
def _parse_address(result: dict, search_type):
if search_type == "general_search":
return Address(
street=f"{result['location']['address']['street_number']} {result['location']['address']['street_name']} {result['location']['address']['street_suffix']}",
unit=result["location"]["address"]["unit"],
city=result["location"]["address"]["city"],
state=result["location"]["address"]["state_code"],
zip=result["location"]["address"]["postal_code"],
)
return Address(
street=f"{result['address']['street_number']} {result['address']['street_name']} {result['address']['street_suffix']}",
unit=result["address"]["unit"],
city=result["address"]["city"],
state=result["address"]["state_code"],
zip=result["address"]["postal_code"],
)
@staticmethod
def _parse_description(result: dict) -> Description:
description_data = result.get("description", {})
return Description(
style=description_data.get("type", "").upper(),
beds=description_data.get("beds"),
baths_full=description_data.get("baths_full"),
baths_half=description_data.get("baths_half"),
sqft=description_data.get("sqft"),
lot_sqft=description_data.get("lot_sqft"),
sold_price=description_data.get("sold_price"),
year_built=description_data.get("year_built"),
garage=description_data.get("garage"),
stories=description_data.get("stories"),
)

View File

@ -1,246 +0,0 @@
"""
homeharvest.redfin.__init__
~~~~~~~~~~~~
This module implements the scraper for redfin.com
"""
import json
from typing import Any
from .. import Scraper
from ....utils import parse_address_two, parse_address_one
from ..models import Property, Address, PropertyType, ListingType, SiteName, Agent
from ....exceptions import NoResultsFound, SearchTooBroad
from datetime import datetime
class RedfinScraper(Scraper):
def __init__(self, scraper_input):
super().__init__(scraper_input)
self.listing_type = scraper_input.listing_type
def _handle_location(self):
url = "https://www.redfin.com/stingray/do/location-autocomplete?v=2&al=1&location={}".format(self.location)
response = self.session.get(url)
response_json = json.loads(response.text.replace("{}&&", ""))
def get_region_type(match_type: str):
if match_type == "4":
return "2" #: zip
elif match_type == "2":
return "6" #: city
elif match_type == "1":
return "address" #: address, needs to be handled differently
elif match_type == "11":
return "state"
if "exactMatch" not in response_json["payload"]:
raise NoResultsFound("No results found for location: {}".format(self.location))
if response_json["payload"]["exactMatch"] is not None:
target = response_json["payload"]["exactMatch"]
else:
target = response_json["payload"]["sections"][0]["rows"][0]
return target["id"].split("_")[1], get_region_type(target["type"])
def _parse_home(self, home: dict, single_search: bool = False) -> Property:
def get_value(key: str) -> Any | None:
if key in home and "value" in home[key]:
return home[key]["value"]
if not single_search:
address = Address(
address_one=parse_address_one(get_value("streetLine"))[0],
address_two=parse_address_one(get_value("streetLine"))[1],
city=home.get("city"),
state=home.get("state"),
zip_code=home.get("zip"),
)
else:
address_info = home.get("streetAddress")
address_one, address_two = parse_address_one(address_info.get("assembledAddress"))
address = Address(
address_one=address_one,
address_two=address_two,
city=home.get("city"),
state=home.get("state"),
zip_code=home.get("zip"),
)
url = "https://www.redfin.com{}".format(home["url"])
lot_size_data = home.get("lotSize")
if not isinstance(lot_size_data, int):
lot_size = lot_size_data.get("value", None) if isinstance(lot_size_data, dict) else None
else:
lot_size = lot_size_data
lat_long = get_value("latLong")
return Property(
site_name=self.site_name,
listing_type=self.listing_type,
address=address,
property_url=url,
beds_min=home["beds"] if "beds" in home else None,
beds_max=home["beds"] if "beds" in home else None,
baths_min=home["baths"] if "baths" in home else None,
baths_max=home["baths"] if "baths" in home else None,
price_min=get_value("price"),
price_max=get_value("price"),
sqft_min=get_value("sqFt"),
sqft_max=get_value("sqFt"),
stories=home["stories"] if "stories" in home else None,
agent=Agent( #: listingAgent, some have sellingAgent as well
name=home['listingAgent'].get('name') if 'listingAgent' in home else None,
phone=home['listingAgent'].get('phone') if 'listingAgent' in home else None,
),
description=home["listingRemarks"] if "listingRemarks" in home else None,
year_built=get_value("yearBuilt") if not single_search else home.get("yearBuilt"),
lot_area_value=lot_size,
property_type=PropertyType.from_int_code(home.get("propertyType")),
price_per_sqft=get_value("pricePerSqFt") if type(home.get("pricePerSqFt")) != int else home.get("pricePerSqFt"),
mls_id=get_value("mlsId"),
latitude=lat_long.get('latitude') if lat_long else None,
longitude=lat_long.get('longitude') if lat_long else None,
sold_date=datetime.fromtimestamp(home['soldDate'] / 1000) if 'soldDate' in home else None,
days_on_market=get_value("dom")
)
def _handle_rentals(self, region_id, region_type):
url = f"https://www.redfin.com/stingray/api/v1/search/rentals?al=1&isRentals=true&region_id={region_id}&region_type={region_type}&num_homes=100000"
response = self.session.get(url)
response.raise_for_status()
homes = response.json()
properties_list = []
for home in homes["homes"]:
home_data = home["homeData"]
rental_data = home["rentalExtension"]
property_url = f"https://www.redfin.com{home_data.get('url', '')}"
address_info = home_data.get("addressInfo", {})
centroid = address_info.get("centroid", {}).get("centroid", {})
address = Address(
address_one=parse_address_one(address_info.get("formattedStreetLine"))[0],
city=address_info.get("city"),
state=address_info.get("state"),
zip_code=address_info.get("zip"),
)
price_range = rental_data.get("rentPriceRange", {"min": None, "max": None})
bed_range = rental_data.get("bedRange", {"min": None, "max": None})
bath_range = rental_data.get("bathRange", {"min": None, "max": None})
sqft_range = rental_data.get("sqftRange", {"min": None, "max": None})
property_ = Property(
property_url=property_url,
site_name=SiteName.REDFIN,
listing_type=ListingType.FOR_RENT,
address=address,
description=rental_data.get("description"),
latitude=centroid.get("latitude"),
longitude=centroid.get("longitude"),
baths_min=bath_range.get("min"),
baths_max=bath_range.get("max"),
beds_min=bed_range.get("min"),
beds_max=bed_range.get("max"),
price_min=price_range.get("min"),
price_max=price_range.get("max"),
sqft_min=sqft_range.get("min"),
sqft_max=sqft_range.get("max"),
img_src=home_data.get("staticMapUrl"),
posted_time=rental_data.get("lastUpdated"),
bldg_name=rental_data.get("propertyName"),
)
properties_list.append(property_)
if not properties_list:
raise NoResultsFound("No rentals found for the given location.")
return properties_list
def _parse_building(self, building: dict) -> Property:
street_address = " ".join(
[
building["address"]["streetNumber"],
building["address"]["directionalPrefix"],
building["address"]["streetName"],
building["address"]["streetType"],
]
)
return Property(
site_name=self.site_name,
property_type=PropertyType("BUILDING"),
address=Address(
address_one=parse_address_one(street_address)[0],
city=building["address"]["city"],
state=building["address"]["stateOrProvinceCode"],
zip_code=building["address"]["postalCode"],
address_two=parse_address_two(
" ".join(
[
building["address"]["unitType"],
building["address"]["unitValue"],
]
)
),
),
property_url="https://www.redfin.com{}".format(building["url"]),
listing_type=self.listing_type,
unit_count=building.get("numUnitsForSale"),
)
def handle_address(self, home_id: str):
"""
EPs:
https://www.redfin.com/stingray/api/home/details/initialInfo?al=1&path=/TX/Austin/70-Rainey-St-78701/unit-1608/home/147337694
https://www.redfin.com/stingray/api/home/details/mainHouseInfoPanelInfo?propertyId=147337694&accessLevel=3
https://www.redfin.com/stingray/api/home/details/aboveTheFold?propertyId=147337694&accessLevel=3
https://www.redfin.com/stingray/api/home/details/belowTheFold?propertyId=147337694&accessLevel=3
"""
url = "https://www.redfin.com/stingray/api/home/details/aboveTheFold?propertyId={}&accessLevel=3".format(
home_id
)
response = self.session.get(url)
response_json = json.loads(response.text.replace("{}&&", ""))
parsed_home = self._parse_home(response_json["payload"]["addressSectionInfo"], single_search=True)
return [parsed_home]
def search(self):
region_id, region_type = self._handle_location()
if region_type == "state":
raise SearchTooBroad("State searches are not supported, please use a more specific location.")
if region_type == "address":
home_id = region_id
return self.handle_address(home_id)
if self.listing_type == ListingType.FOR_RENT:
return self._handle_rentals(region_id, region_type)
else:
if self.listing_type == ListingType.FOR_SALE:
url = f"https://www.redfin.com/stingray/api/gis?al=1&region_id={region_id}&region_type={region_type}&num_homes=100000"
else:
url = f"https://www.redfin.com/stingray/api/gis?al=1&region_id={region_id}&region_type={region_type}&sold_within_days=30&num_homes=100000"
response = self.session.get(url)
response_json = json.loads(response.text.replace("{}&&", ""))
if "payload" in response_json:
homes_list = response_json["payload"].get("homes", [])
buildings_list = response_json["payload"].get("buildings", {}).values()
homes = [self._parse_home(home) for home in homes_list] + [
self._parse_building(building) for building in buildings_list
]
return homes
else:
return []

View File

@ -1,335 +0,0 @@
"""
homeharvest.zillow.__init__
~~~~~~~~~~~~
This module implements the scraper for zillow.com
"""
import re
import json
import tls_client
from .. import Scraper
from requests.exceptions import HTTPError
from ....utils import parse_address_one, parse_address_two
from ....exceptions import GeoCoordsNotFound, NoResultsFound
from ..models import Property, Address, ListingType, PropertyType, Agent
import urllib.parse
from datetime import datetime, timedelta
class ZillowScraper(Scraper):
def __init__(self, scraper_input):
session = tls_client.Session(
client_identifier="chrome112", random_tls_extension_order=True
)
super().__init__(scraper_input, session)
self.session.headers.update({
'authority': 'www.zillow.com',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'max-age=0',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',
})
if not self.is_plausible_location(self.location):
raise NoResultsFound("Invalid location input: {}".format(self.location))
listing_type_to_url_path = {
ListingType.FOR_SALE: "for_sale",
ListingType.FOR_RENT: "for_rent",
ListingType.SOLD: "recently_sold",
}
self.url = f"https://www.zillow.com/homes/{listing_type_to_url_path[self.listing_type]}/{self.location}_rb/"
def is_plausible_location(self, location: str) -> bool:
url = (
"https://www.zillowstatic.com/autocomplete/v3/suggestions?q={"
"}&abKey=6666272a-4b99-474c-b857-110ec438732b&clientId=homepage-render"
).format(urllib.parse.quote(location))
resp = self.session.get(url)
return resp.json()["results"] != []
def search(self):
resp = self.session.get(self.url)
if resp.status_code != 200:
raise HTTPError(
f"bad response status code: {resp.status_code}"
)
content = resp.text
match = re.search(
r'<script id="__NEXT_DATA__" type="application/json">(.*?)</script>',
content,
re.DOTALL,
)
if not match:
raise NoResultsFound("No results were found for Zillow with the given Location.")
json_str = match.group(1)
data = json.loads(json_str)
if "searchPageState" in data["props"]["pageProps"]:
pattern = r'window\.mapBounds = \{\s*"west":\s*(-?\d+\.\d+),\s*"east":\s*(-?\d+\.\d+),\s*"south":\s*(-?\d+\.\d+),\s*"north":\s*(-?\d+\.\d+)\s*\};'
match = re.search(pattern, content)
if match:
coords = [float(coord) for coord in match.groups()]
return self._fetch_properties_backend(coords)
else:
raise GeoCoordsNotFound("Box bounds could not be located.")
elif "gdpClientCache" in data["props"]["pageProps"]:
gdp_client_cache = json.loads(data["props"]["pageProps"]["gdpClientCache"])
main_key = list(gdp_client_cache.keys())[0]
property_data = gdp_client_cache[main_key]["property"]
property = self._get_single_property_page(property_data)
return [property]
raise NoResultsFound("Specific property data not found in the response.")
def _fetch_properties_backend(self, coords):
url = "https://www.zillow.com/async-create-search-page-state"
filter_state_for_sale = {
"sortSelection": {
# "value": "globalrelevanceex"
"value": "days"
},
"isAllHomes": {"value": True},
}
filter_state_for_rent = {
"isForRent": {"value": True},
"isForSaleByAgent": {"value": False},
"isForSaleByOwner": {"value": False},
"isNewConstruction": {"value": False},
"isComingSoon": {"value": False},
"isAuction": {"value": False},
"isForSaleForeclosure": {"value": False},
"isAllHomes": {"value": True},
}
filter_state_sold = {
"isRecentlySold": {"value": True},
"isForSaleByAgent": {"value": False},
"isForSaleByOwner": {"value": False},
"isNewConstruction": {"value": False},
"isComingSoon": {"value": False},
"isAuction": {"value": False},
"isForSaleForeclosure": {"value": False},
"isAllHomes": {"value": True},
}
selected_filter = (
filter_state_for_rent
if self.listing_type == ListingType.FOR_RENT
else filter_state_for_sale
if self.listing_type == ListingType.FOR_SALE
else filter_state_sold
)
payload = {
"searchQueryState": {
"pagination": {},
"isMapVisible": True,
"mapBounds": {
"west": coords[0],
"east": coords[1],
"south": coords[2],
"north": coords[3],
},
"filterState": selected_filter,
"isListVisible": True,
"mapZoom": 11,
},
"wants": {"cat1": ["mapResults"]},
"isDebugRequest": False,
}
resp = self.session.put(url, json=payload)
if resp.status_code != 200:
raise HTTPError(
f"bad response status code: {resp.status_code}"
)
return self._parse_properties(resp.json())
@staticmethod
def parse_posted_time(time: str) -> datetime:
int_time = int(time.split(" ")[0])
if "hour" in time:
return datetime.now() - timedelta(hours=int_time)
if "day" in time:
return datetime.now() - timedelta(days=int_time)
def _parse_properties(self, property_data: dict):
mapresults = property_data["cat1"]["searchResults"]["mapResults"]
properties_list = []
for result in mapresults:
if "hdpData" in result:
home_info = result["hdpData"]["homeInfo"]
address_data = {
"address_one": parse_address_one(home_info.get("streetAddress"))[0],
"address_two": parse_address_two(home_info["unit"]) if "unit" in home_info else "#",
"city": home_info.get("city"),
"state": home_info.get("state"),
"zip_code": home_info.get("zipcode"),
}
property_obj = Property(
site_name=self.site_name,
address=Address(**address_data),
property_url=f"https://www.zillow.com{result['detailUrl']}",
tax_assessed_value=int(home_info["taxAssessedValue"]) if "taxAssessedValue" in home_info else None,
property_type=PropertyType(home_info.get("homeType")),
listing_type=ListingType(
home_info["statusType"] if "statusType" in home_info else self.listing_type
),
status_text=result.get("statusText"),
posted_time=self.parse_posted_time(result["variableData"]["text"])
if "variableData" in result
and "text" in result["variableData"]
and result["variableData"]["type"] == "TIME_ON_INFO"
else None,
price_min=home_info.get("price"),
price_max=home_info.get("price"),
beds_min=int(home_info["bedrooms"]) if "bedrooms" in home_info else None,
beds_max=int(home_info["bedrooms"]) if "bedrooms" in home_info else None,
baths_min=home_info.get("bathrooms"),
baths_max=home_info.get("bathrooms"),
sqft_min=int(home_info["livingArea"]) if "livingArea" in home_info else None,
sqft_max=int(home_info["livingArea"]) if "livingArea" in home_info else None,
price_per_sqft=int(home_info["price"] // home_info["livingArea"])
if "livingArea" in home_info and home_info["livingArea"] != 0 and "price" in home_info
else None,
latitude=result["latLong"]["latitude"],
longitude=result["latLong"]["longitude"],
lot_area_value=round(home_info["lotAreaValue"], 2) if "lotAreaValue" in home_info else None,
lot_area_unit=home_info.get("lotAreaUnit"),
img_src=result.get("imgSrc"),
)
properties_list.append(property_obj)
elif "isBuilding" in result:
price_string = result["price"].replace("$", "").replace(",", "").replace("+/mo", "")
match = re.search(r"(\d+)", price_string)
price_value = int(match.group(1)) if match else None
building_obj = Property(
property_url=f"https://www.zillow.com{result['detailUrl']}",
site_name=self.site_name,
property_type=PropertyType("BUILDING"),
listing_type=ListingType(result["statusType"]),
img_src=result.get("imgSrc"),
address=self._extract_address(result["address"]),
baths_min=result.get("minBaths"),
area_min=result.get("minArea"),
bldg_name=result.get("communityName"),
status_text=result.get("statusText"),
price_min=price_value if "+/mo" in result.get("price") else None,
price_max=price_value if "+/mo" in result.get("price") else None,
latitude=result.get("latLong", {}).get("latitude"),
longitude=result.get("latLong", {}).get("longitude"),
unit_count=result.get("unitCount"),
)
properties_list.append(building_obj)
return properties_list
def _get_single_property_page(self, property_data: dict):
"""
This method is used when a user enters the exact location & zillow returns just one property
"""
url = (
f"https://www.zillow.com{property_data['hdpUrl']}"
if "zillow.com" not in property_data["hdpUrl"]
else property_data["hdpUrl"]
)
address_data = property_data["address"]
address_one, address_two = parse_address_one(address_data["streetAddress"])
address = Address(
address_one=address_one,
address_two=address_two if address_two else "#",
city=address_data["city"],
state=address_data["state"],
zip_code=address_data["zipcode"],
)
property_type = property_data.get("homeType", None)
return Property(
site_name=self.site_name,
property_url=url,
property_type=PropertyType(property_type) if property_type in PropertyType.__members__ else None,
listing_type=self.listing_type,
address=address,
year_built=property_data.get("yearBuilt"),
tax_assessed_value=property_data.get("taxAssessedValue"),
lot_area_value=property_data.get("lotAreaValue"),
lot_area_unit=property_data["lotAreaUnits"].lower() if "lotAreaUnits" in property_data else None,
agent=Agent(
name=property_data.get("attributionInfo", {}).get("agentName")
),
stories=property_data.get("resoFacts", {}).get("stories"),
mls_id=property_data.get("attributionInfo", {}).get("mlsId"),
beds_min=property_data.get("bedrooms"),
beds_max=property_data.get("bedrooms"),
baths_min=property_data.get("bathrooms"),
baths_max=property_data.get("bathrooms"),
price_min=property_data.get("price"),
price_max=property_data.get("price"),
sqft_min=property_data.get("livingArea"),
sqft_max=property_data.get("livingArea"),
price_per_sqft=property_data.get("resoFacts", {}).get("pricePerSquareFoot"),
latitude=property_data.get("latitude"),
longitude=property_data.get("longitude"),
img_src=property_data.get("streetViewTileImageUrlMediumAddress"),
description=property_data.get("description"),
)
def _extract_address(self, address_str):
"""
Extract address components from a string formatted like '555 Wedglea Dr, Dallas, TX',
and return an Address object.
"""
parts = address_str.split(", ")
if len(parts) != 3:
raise ValueError(f"Unexpected address format: {address_str}")
address_one = parts[0].strip()
city = parts[1].strip()
state_zip = parts[2].split(" ")
if len(state_zip) == 1:
state = state_zip[0].strip()
zip_code = None
elif len(state_zip) == 2:
state = state_zip[0].strip()
zip_code = state_zip[1].strip()
else:
raise ValueError(f"Unexpected state/zip format in address: {address_str}")
address_one, address_two = parse_address_one(address_one)
return Address(
address_one=address_one,
address_two=address_two if address_two else "#",
city=city,
state=state,
zip_code=zip_code,
)

View File

@ -1,18 +1,6 @@
class InvalidSite(Exception):
"""Raised when a provided site is does not exist."""
class InvalidListingType(Exception): class InvalidListingType(Exception):
"""Raised when a provided listing type is does not exist.""" """Raised when a provided listing type is does not exist."""
class NoResultsFound(Exception): class NoResultsFound(Exception):
"""Raised when no results are found for the given location""" """Raised when no results are found for the given location"""
class GeoCoordsNotFound(Exception):
"""Raised when no property is found for the given address"""
class SearchTooBroad(Exception):
"""Raised when the search is too broad"""

View File

@ -1,38 +1,71 @@
import re from .core.scrapers.models import Property, ListingType
import pandas as pd
from .exceptions import InvalidListingType
ordered_properties = [
"property_url",
"mls",
"mls_id",
"status",
"style",
"street",
"unit",
"city",
"state",
"zip_code",
"beds",
"full_baths",
"half_baths",
"sqft",
"year_built",
"list_price",
"list_date",
"sold_price",
"last_sold_date",
"lot_sqft",
"price_per_sqft",
"latitude",
"longitude",
"stories",
"hoa_fee",
"parking_garage",
]
def parse_address_one(street_address: str) -> tuple: def process_result(result: Property) -> pd.DataFrame:
if not street_address: prop_data = {prop: None for prop in ordered_properties}
return street_address, "#" prop_data.update(result.__dict__)
apt_match = re.search( if "address" in prop_data:
r"(APT\s*[\dA-Z]+|#[\dA-Z]+|UNIT\s*[\dA-Z]+|LOT\s*[\dA-Z]+|SUITE\s*[\dA-Z]+)$", address_data = prop_data["address"]
street_address, prop_data["street"] = address_data.street
re.I, prop_data["unit"] = address_data.unit
) prop_data["city"] = address_data.city
prop_data["state"] = address_data.state
prop_data["zip_code"] = address_data.zip
if apt_match: prop_data["price_per_sqft"] = prop_data["prc_sqft"]
apt_str = apt_match.group().strip()
cleaned_apt_str = re.sub(r"(APT\s*|UNIT\s*|LOT\s*|SUITE\s*)", "#", apt_str, flags=re.I)
main_address = street_address.replace(apt_str, "").strip() description = result.description
return main_address, cleaned_apt_str prop_data["style"] = description.style
else: prop_data["beds"] = description.beds
return street_address, "#" prop_data["full_baths"] = description.baths_full
prop_data["half_baths"] = description.baths_half
prop_data["sqft"] = description.sqft
prop_data["lot_sqft"] = description.lot_sqft
prop_data["sold_price"] = description.sold_price
prop_data["year_built"] = description.year_built
prop_data["parking_garage"] = description.garage
prop_data["stories"] = description.stories
properties_df = pd.DataFrame([prop_data])
properties_df = properties_df.reindex(columns=ordered_properties)
return properties_df[ordered_properties]
def parse_address_two(street_address: str): def validate_input(listing_type: str) -> None:
if not street_address: if listing_type.upper() not in ListingType.__members__:
return "#" raise InvalidListingType(
apt_match = re.search( f"Provided listing type, '{listing_type}', does not exist."
r"(APT\s*[\dA-Z]+|#[\dA-Z]+|UNIT\s*[\dA-Z]+|LOT\s*[\dA-Z]+|SUITE\s*[\dA-Z]+)$", )
street_address,
re.I,
)
if apt_match:
apt_str = apt_match.group().strip()
apt_str = re.sub(r"(APT\s*|UNIT\s*|LOT\s*|SUITE\s*)", "#", apt_str, flags=re.I)
return apt_str
else:
return "#"

View File

@ -1,6 +1,6 @@
[tool.poetry] [tool.poetry]
name = "homeharvest" name = "homeharvest"
version = "0.2.19" version = "0.3.0"
description = "Real estate scraping library supporting Zillow, Realtor.com & Redfin." description = "Real estate scraping library supporting Zillow, Realtor.com & Redfin."
authors = ["Zachary Hampton <zachary@zacharysproducts.com>", "Cullen Watson <cullen@cullen.ai>"] authors = ["Zachary Hampton <zachary@zacharysproducts.com>", "Cullen Watson <cullen@cullen.ai>"]
homepage = "https://github.com/ZacharyHampton/HomeHarvest" homepage = "https://github.com/ZacharyHampton/HomeHarvest"

View File

@ -1,26 +1,78 @@
from homeharvest import scrape_property from homeharvest import scrape_property
from homeharvest.exceptions import ( from homeharvest.exceptions import (
InvalidSite,
InvalidListingType, InvalidListingType,
NoResultsFound, NoResultsFound,
GeoCoordsNotFound,
) )
def test_realtor_pending_or_contingent():
pending_or_contingent_result = scrape_property(
location="Surprise, AZ",
pending_or_contingent=True,
)
regular_result = scrape_property(
location="Surprise, AZ",
pending_or_contingent=False,
)
assert all([result is not None for result in [pending_or_contingent_result, regular_result]])
assert len(pending_or_contingent_result) != len(regular_result)
def test_realtor_comps():
result = scrape_property(
location="2530 Al Lipscomb Way",
radius=0.5,
property_younger_than=180,
listing_type="sold",
)
assert result is not None and len(result) > 0
def test_realtor_last_x_days_sold():
days_result_30 = scrape_property(
location="Dallas, TX", listing_type="sold", property_younger_than=30
)
days_result_10 = scrape_property(
location="Dallas, TX", listing_type="sold", property_younger_than=10
)
assert all(
[result is not None for result in [days_result_30, days_result_10]]
) and len(days_result_30) != len(days_result_10)
def test_realtor_single_property():
results = [
scrape_property(
location="15509 N 172nd Dr, Surprise, AZ 85388",
listing_type="for_sale",
),
scrape_property(
location="2530 Al Lipscomb Way",
listing_type="for_sale",
),
]
assert all([result is not None for result in results])
def test_realtor(): def test_realtor():
results = [ results = [
scrape_property( scrape_property(
location="2530 Al Lipscomb Way", location="2530 Al Lipscomb Way",
site_name="realtor.com",
listing_type="for_sale", listing_type="for_sale",
), ),
scrape_property( scrape_property(
location="Phoenix, AZ", site_name=["realtor.com"], listing_type="for_rent" location="Phoenix, AZ", listing_type="for_rent"
), #: does not support "city, state, USA" format ), #: does not support "city, state, USA" format
scrape_property( scrape_property(
location="Dallas, TX", site_name="realtor.com", listing_type="sold" location="Dallas, TX", listing_type="sold"
), #: does not support "city, state, USA" format ), #: does not support "city, state, USA" format
scrape_property(location="85281", site_name="realtor.com"), scrape_property(location="85281"),
] ]
assert all([result is not None for result in results]) assert all([result is not None for result in results])
@ -30,11 +82,10 @@ def test_realtor():
bad_results += [ bad_results += [
scrape_property( scrape_property(
location="abceefg ju098ot498hh9", location="abceefg ju098ot498hh9",
site_name="realtor.com",
listing_type="for_sale", listing_type="for_sale",
) )
] ]
except (InvalidSite, InvalidListingType, NoResultsFound, GeoCoordsNotFound): except (InvalidListingType, NoResultsFound):
assert True assert True
assert all([result is None for result in bad_results]) assert all([result is None for result in bad_results])

View File

@ -1,35 +0,0 @@
from homeharvest import scrape_property
from homeharvest.exceptions import (
InvalidSite,
InvalidListingType,
NoResultsFound,
GeoCoordsNotFound,
SearchTooBroad,
)
def test_redfin():
results = [
scrape_property(location="San Diego", site_name="redfin", listing_type="for_sale"),
scrape_property(location="2530 Al Lipscomb Way", site_name="redfin", listing_type="for_sale"),
scrape_property(location="Phoenix, AZ, USA", site_name=["redfin"], listing_type="for_rent"),
scrape_property(location="Dallas, TX, USA", site_name="redfin", listing_type="sold"),
scrape_property(location="85281", site_name="redfin"),
]
assert all([result is not None for result in results])
bad_results = []
try:
bad_results += [
scrape_property(
location="abceefg ju098ot498hh9",
site_name="redfin",
listing_type="for_sale",
),
scrape_property(location="Florida", site_name="redfin", listing_type="for_rent"),
]
except (InvalidSite, InvalidListingType, NoResultsFound, GeoCoordsNotFound, SearchTooBroad):
assert True
assert all([result is None for result in bad_results])

View File

@ -1,24 +0,0 @@
from homeharvest.utils import parse_address_one, parse_address_two
def test_parse_address_one():
test_data = [
("4303 E Cactus Rd Apt 126", ("4303 E Cactus Rd", "#126")),
("1234 Elm Street apt 2B", ("1234 Elm Street", "#2B")),
("1234 Elm Street UNIT 3A", ("1234 Elm Street", "#3A")),
("1234 Elm Street unit 3A", ("1234 Elm Street", "#3A")),
("1234 Elm Street SuIte 3A", ("1234 Elm Street", "#3A")),
]
for input_data, (exp_addr_one, exp_addr_two) in test_data:
address_one, address_two = parse_address_one(input_data)
assert address_one == exp_addr_one
assert address_two == exp_addr_two
def test_parse_address_two():
test_data = [("Apt 126", "#126"), ("apt 2B", "#2B"), ("UNIT 3A", "#3A"), ("unit 3A", "#3A"), ("SuIte 3A", "#3A")]
for input_data, expected in test_data:
output = parse_address_two(input_data)
assert output == expected

View File

@ -1,34 +0,0 @@
from homeharvest import scrape_property
from homeharvest.exceptions import (
InvalidSite,
InvalidListingType,
NoResultsFound,
GeoCoordsNotFound,
)
def test_zillow():
results = [
scrape_property(location="2530 Al Lipscomb Way", site_name="zillow", listing_type="for_sale"),
scrape_property(location="Phoenix, AZ, USA", site_name=["zillow"], listing_type="for_rent"),
scrape_property(location="Surprise, AZ", site_name=["zillow"], listing_type="for_sale"),
scrape_property(location="Dallas, TX, USA", site_name="zillow", listing_type="sold"),
scrape_property(location="85281", site_name="zillow"),
scrape_property(location="3268 88th st s, Lakewood", site_name="zillow", listing_type="for_rent"),
]
assert all([result is not None for result in results])
bad_results = []
try:
bad_results += [
scrape_property(
location="abceefg ju098ot498hh9",
site_name="zillow",
listing_type="for_sale",
)
]
except (InvalidSite, InvalidListingType, NoResultsFound, GeoCoordsNotFound):
assert True
assert all([result is None for result in bad_results])