diff --git a/README.md b/README.md
index 74fade3..e7c72b1 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,6 @@
-**HomeHarvest** is a simple, yet comprehensive, real estate scraping library.
+**HomeHarvest** is a simple, yet comprehensive, real estate scraping library that extracts and formats data in the style of MLS listings.
[![Try with Replit](https://replit.com/badge?caption=Try%20with%20Replit)](https://replit.com/@ZacharyHampton/HomeHarvestDemo)
@@ -11,10 +11,14 @@
Check out another project we wrote: ***[JobSpy](https://github.com/cullenwatson/JobSpy)** – a Python package for job scraping*
-## Features
+## HomeHarvest Features
-- Scrapes properties from **Zillow**, **Realtor.com** & **Redfin** simultaneously
-- Aggregates the properties in a Pandas DataFrame
+- **Source**: Fetches properties directly from **Realtor.com**.
+- **Data Format**: Structures data to resemble MLS listings.
+- **Export Flexibility**: Options to save as either CSV or Excel.
+- **Usage Modes**:
+ - **CLI**: For users who prefer command-line operations.
+ - **Python**: For those who'd like to integrate scraping into their Python scripts.
[Video Guide for HomeHarvest](https://youtu.be/JnV7eR2Ve2o) - _updated for release v0.2.7_
@@ -31,136 +35,150 @@ pip install homeharvest
### CLI
+```
+usage: homeharvest [-l {for_sale,for_rent,sold}] [-o {excel,csv}] [-f FILENAME] [-p PROXY] [-d DAYS] [-r RADIUS] [-m] location
+
+Home Harvest Property Scraper
+
+positional arguments:
+ location Location to scrape (e.g., San Francisco, CA)
+
+options:
+ -l {for_sale,for_rent,sold}, --listing_type {for_sale,for_rent,sold}
+ Listing type to scrape
+ -o {excel,csv}, --output {excel,csv}
+ Output format
+ -f FILENAME, --filename FILENAME
+ Name of the output file (without extension)
+ -p PROXY, --proxy PROXY
+ Proxy to use for scraping
+ -d DAYS, --days DAYS Sold/listed in last _ days filter.
+ -r RADIUS, --radius RADIUS
+ Get comparable properties within _ (eg. 0.0) miles. Only applicable for individual addresses.
+ -m, --mls_only If set, fetches only MLS listings.
+```
```bash
-homeharvest "San Francisco, CA" -s zillow realtor.com redfin -l for_rent -o excel -f HomeHarvest
+> homeharvest "San Francisco, CA" -l for_rent -o excel -f HomeHarvest
```
-This will scrape properties from the specified sites for the given location and listing type, and save the results to an Excel file named `HomeHarvest.xlsx`.
-
-By default:
-- If `-s` or `--site_name` is not provided, it will scrape from all available sites.
-- If `-l` or `--listing_type` is left blank, the default is `for_sale`. Other options are `for_rent` or `sold`.
-- The `-o` or `--output` default format is `excel`. Options are `csv` or `excel`.
-- If `-f` or `--filename` is left blank, the default is `HomeHarvest_`.
-- If `-p` or `--proxy` is not provided, the scraper uses the local IP.
-- Use `-k` or `--keep_duplicates` to keep duplicate properties based on address. If not provided, duplicates will be removed.
-### Python
+### Python
```py
from homeharvest import scrape_property
-import pandas as pd
+from datetime import datetime
-properties: pd.DataFrame = scrape_property(
- site_name=["zillow", "realtor.com", "redfin"],
- location="85281",
- listing_type="for_rent" # for_sale / sold
+# Generate filename based on current timestamp
+current_timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+filename = f"output/{current_timestamp}.csv"
+
+properties = scrape_property(
+ location="San Diego, CA",
+ listing_type="sold", # or (for_sale, for_rent)
+ property_younger_than=30, # sold in last 30 days - listed in last x days if (for_sale, for_rent)
+ mls_only=True, # only fetch MLS listings
)
+print(f"Number of properties: {len(properties)}")
-#: Note, to export to CSV or Excel, use properties.to_csv() or properties.to_excel().
-print(properties)
+# Export to csv
+properties.to_csv(filename, index=False)
+print(properties.head())
```
## Output
-```py
->>> properties.head()
- property_url site_name listing_type apt_min_price apt_max_price ...
-0 https://www.redfin.com/AZ/Tempe/1003-W-Washing... redfin for_rent 1666.0 2750.0 ...
-1 https://www.redfin.com/AZ/Tempe/VELA-at-Town-L... redfin for_rent 1665.0 3763.0 ...
-2 https://www.redfin.com/AZ/Tempe/Camden-Tempe/a... redfin for_rent 1939.0 3109.0 ...
-3 https://www.redfin.com/AZ/Tempe/Emerson-Park/a... redfin for_rent 1185.0 1817.0 ...
-4 https://www.redfin.com/AZ/Tempe/Rio-Paradiso-A... redfin for_rent 1470.0 2235.0 ...
-[5 rows x 41 columns]
-```
-
-### Parameters for `scrape_properties()`
```plaintext
-Required
-├── location (str): address in various formats e.g. just zip, full address, city/state, etc.
-└── listing_type (enum): for_rent, for_sale, sold
-Optional
-├── site_name (list[enum], default=all three sites): zillow, realtor.com, redfin
-├── proxy (str): in format 'http://user:pass@host:port' or [https, socks]
-└── keep_duplicates (bool, default=False): whether to keep or remove duplicate properties based on address
+>>> properties.head()
+ MLS MLS # Status Style ... COEDate LotSFApx PrcSqft Stories
+0 SDCA 230018348 SOLD CONDOS ... 2023-10-03 290110 803 2
+1 SDCA 230016614 SOLD TOWNHOMES ... 2023-10-03 None 838 3
+2 SDCA 230016367 SOLD CONDOS ... 2023-10-03 30056 649 1
+3 MRCA NDP2306335 SOLD SINGLE_FAMILY ... 2023-10-03 7519 661 2
+4 SDCA 230014532 SOLD CONDOS ... 2023-10-03 None 752 1
+[5 rows x 22 columns]
```
+### Parameters for `scrape_property()`
+```
+Required
+├── location (str): The address in various formats - this could be just a zip code, a full address, or city/state, etc.
+└── listing_type (option): Choose the type of listing.
+ - 'for_rent'
+ - 'for_sale'
+ - 'sold'
+
+Optional
+├── radius (decimal): Radius in miles to find comparable properties based on individual addresses.
+│ Example: 5.5 (fetches properties within a 5.5-mile radius if location is set to a specific address; otherwise, ignored)
+│
+├── property_younger_than (integer): Number of past days to filter properties. Utilizes 'last_sold_date' for 'sold' listing types, and 'list_date' for others (for_rent, for_sale).
+│ Example: 30 (fetches properties listed/sold in the last 30 days)
+│
+├── mls_only (True/False): If set, fetches only MLS listings (mainly applicable to 'sold' listings)
+│
+└── proxy (string): In format 'http://user:pass@host:port'
+
+```
### Property Schema
```plaintext
Property
├── Basic Information:
-│ ├── property_url (str)
-│ ├── site_name (enum): zillow, redfin, realtor.com
-│ ├── listing_type (enum): for_sale, for_rent, sold
-│ └── property_type (enum): house, apartment, condo, townhouse, single_family, multi_family, building
+│ ├── property_url
+│ ├── mls
+│ ├── mls_id
+│ └── status
├── Address Details:
-│ ├── street_address (str)
-│ ├── city (str)
-│ ├── state (str)
-│ ├── zip_code (str)
-│ ├── unit (str)
-│ └── country (str)
+│ ├── street
+│ ├── unit
+│ ├── city
+│ ├── state
+│ └── zip_code
-├── House for Sale Features:
-│ ├── tax_assessed_value (int)
-│ ├── lot_area_value (float)
-│ ├── lot_area_unit (str)
-│ ├── stories (int)
-│ ├── year_built (int)
-│ └── price_per_sqft (int)
+├── Property Description:
+│ ├── style
+│ ├── beds
+│ ├── full_baths
+│ ├── half_baths
+│ ├── sqft
+│ ├── year_built
+│ ├── stories
+│ └── lot_sqft
-├── Building for Sale and Apartment Details:
-│ ├── bldg_name (str)
-│ ├── beds_min (int)
-│ ├── beds_max (int)
-│ ├── baths_min (float)
-│ ├── baths_max (float)
-│ ├── sqft_min (int)
-│ ├── sqft_max (int)
-│ ├── price_min (int)
-│ ├── price_max (int)
-│ ├── area_min (int)
-│ └── unit_count (int)
+├── Property Listing Details:
+│ ├── list_price
+│ ├── list_date
+│ ├── sold_price
+│ ├── last_sold_date
+│ ├── price_per_sqft
+│ └── hoa_fee
-├── Miscellaneous Details:
-│ ├── mls_id (str)
-│ ├── agent_name (str)
-│ ├── img_src (str)
-│ ├── description (str)
-│ ├── status_text (str)
-│ └── posted_time (str)
+├── Location Details:
+│ ├── latitude
+│ ├── longitude
-└── Location Details:
- ├── latitude (float)
- └── longitude (float)
+└── Parking Details:
+ └── parking_garage
```
-## Supported Countries for Property Scraping
-
-* **Zillow**: contains listings in the **US** & **Canada**
-* **Realtor.com**: mainly from the **US** but also has international listings
-* **Redfin**: listings mainly in the **US**, **Canada**, & has expanded to some areas in **Mexico**
### Exceptions
The following exceptions may be raised when using HomeHarvest:
-- `InvalidSite` - valid options: `zillow`, `redfin`, `realtor.com`
- `InvalidListingType` - valid options: `for_sale`, `for_rent`, `sold`
-- `NoResultsFound` - no properties found from your input
-- `GeoCoordsNotFound` - if Zillow scraper is not able to derive geo-coordinates from the location you input
-
+- `NoResultsFound` - no properties found from your search
+
+
## Frequently Asked Questions
-
---
-**Q: Encountering issues with your queries?**
-**A:** Try a single site and/or broaden the location. If problems persist, [submit an issue](https://github.com/ZacharyHampton/HomeHarvest/issues).
+**Q: Encountering issues with your searches?**
+**A:** Try to broaden the parameters you're using. If problems persist, [submit an issue](https://github.com/ZacharyHampton/HomeHarvest/issues).
---
**Q: Received a Forbidden 403 response code?**
-**A:** This indicates that you have been blocked by the real estate site for sending too many requests. Currently, **Zillow** is particularly aggressive with blocking. We recommend:
+**A:** This indicates that you have been blocked by Realtor.com for sending too many requests. We recommend:
- Waiting a few seconds between requests.
-- Trying a VPN to change your IP address.
+- Trying a VPN or useing a proxy as a parameter to scrape_property() to change your IP address.
---
diff --git a/HomeHarvest_Demo.ipynb b/examples/HomeHarvest_Demo.ipynb
similarity index 93%
rename from HomeHarvest_Demo.ipynb
rename to examples/HomeHarvest_Demo.ipynb
index fb9106b..43a28be 100644
--- a/HomeHarvest_Demo.ipynb
+++ b/examples/HomeHarvest_Demo.ipynb
@@ -31,7 +31,7 @@
"metadata": {},
"outputs": [],
"source": [
- "# scrapes all 3 sites by default\n",
+ "# check for sale properties\n",
"scrape_property(\n",
" location=\"dallas\",\n",
" listing_type=\"for_sale\"\n",
@@ -53,7 +53,6 @@
"# search a specific address\n",
"scrape_property(\n",
" location=\"2530 Al Lipscomb Way\",\n",
- " site_name=\"zillow\",\n",
" listing_type=\"for_sale\"\n",
")"
]
@@ -68,7 +67,6 @@
"# check rentals\n",
"scrape_property(\n",
" location=\"chicago, illinois\",\n",
- " site_name=[\"redfin\", \"zillow\"],\n",
" listing_type=\"for_rent\"\n",
")"
]
@@ -88,7 +86,6 @@
"# check sold properties\n",
"scrape_property(\n",
" location=\"90210\",\n",
- " site_name=[\"redfin\"],\n",
" listing_type=\"sold\"\n",
")"
]
diff --git a/examples/HomeHarvest_Demo.py b/examples/HomeHarvest_Demo.py
new file mode 100644
index 0000000..b7e999f
--- /dev/null
+++ b/examples/HomeHarvest_Demo.py
@@ -0,0 +1,18 @@
+from homeharvest import scrape_property
+from datetime import datetime
+
+# Generate filename based on current timestamp
+current_timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+filename = f"output/{current_timestamp}.csv"
+
+properties = scrape_property(
+ location="San Diego, CA",
+ listing_type="sold", # for_sale, for_rent
+ property_younger_than=30, # sold/listed in last 30 days
+ mls_only=True, # only fetch MLS listings
+)
+print(f"Number of properties: {len(properties)}")
+
+# Export to csv
+properties.to_csv(filename, index=False)
+print(properties.head())
\ No newline at end of file
diff --git a/homeharvest/__init__.py b/homeharvest/__init__.py
index 2b60e3b..63aa13f 100644
--- a/homeharvest/__init__.py
+++ b/homeharvest/__init__.py
@@ -1,187 +1,50 @@
+import warnings
import pandas as pd
-from typing import Union
-import concurrent.futures
-from concurrent.futures import ThreadPoolExecutor
-
from .core.scrapers import ScraperInput
-from .core.scrapers.redfin import RedfinScraper
+from .utils import process_result, ordered_properties, validate_input
from .core.scrapers.realtor import RealtorScraper
-from .core.scrapers.zillow import ZillowScraper
-from .core.scrapers.models import ListingType, Property, SiteName
-from .exceptions import InvalidSite, InvalidListingType
-
-
-_scrapers = {
- "redfin": RedfinScraper,
- "realtor.com": RealtorScraper,
- "zillow": ZillowScraper,
-}
-
-
-def _validate_input(site_name: str, listing_type: str) -> None:
- if site_name.lower() not in _scrapers:
- raise InvalidSite(f"Provided site, '{site_name}', does not exist.")
-
- if listing_type.upper() not in ListingType.__members__:
- raise InvalidListingType(f"Provided listing type, '{listing_type}', does not exist.")
-
-
-def _get_ordered_properties(result: Property) -> list[str]:
- return [
- "property_url",
- "site_name",
- "listing_type",
- "property_type",
- "status_text",
- "baths_min",
- "baths_max",
- "beds_min",
- "beds_max",
- "sqft_min",
- "sqft_max",
- "price_min",
- "price_max",
- "unit_count",
- "tax_assessed_value",
- "price_per_sqft",
- "lot_area_value",
- "lot_area_unit",
- "address_one",
- "address_two",
- "city",
- "state",
- "zip_code",
- "posted_time",
- "area_min",
- "bldg_name",
- "stories",
- "year_built",
- "agent_name",
- "agent_phone",
- "agent_email",
- "days_on_market",
- "sold_date",
- "mls_id",
- "img_src",
- "latitude",
- "longitude",
- "description",
- ]
-
-
-def _process_result(result: Property) -> pd.DataFrame:
- prop_data = result.__dict__
-
- prop_data["site_name"] = prop_data["site_name"].value
- prop_data["listing_type"] = prop_data["listing_type"].value.lower()
- if "property_type" in prop_data and prop_data["property_type"] is not None:
- prop_data["property_type"] = prop_data["property_type"].value.lower()
- else:
- prop_data["property_type"] = None
- if "address" in prop_data:
- address_data = prop_data["address"]
- prop_data["address_one"] = address_data.address_one
- prop_data["address_two"] = address_data.address_two
- prop_data["city"] = address_data.city
- prop_data["state"] = address_data.state
- prop_data["zip_code"] = address_data.zip_code
-
- del prop_data["address"]
-
- if "agent" in prop_data and prop_data["agent"] is not None:
- agent_data = prop_data["agent"]
- prop_data["agent_name"] = agent_data.name
- prop_data["agent_phone"] = agent_data.phone
- prop_data["agent_email"] = agent_data.email
-
- del prop_data["agent"]
- else:
- prop_data["agent_name"] = None
- prop_data["agent_phone"] = None
- prop_data["agent_email"] = None
-
- properties_df = pd.DataFrame([prop_data])
- properties_df = properties_df[_get_ordered_properties(result)]
-
- return properties_df
-
-
-def _scrape_single_site(location: str, site_name: str, listing_type: str, proxy: str = None) -> pd.DataFrame:
- """
- Helper function to scrape a single site.
- """
- _validate_input(site_name, listing_type)
-
- scraper_input = ScraperInput(
- location=location,
- listing_type=ListingType[listing_type.upper()],
- site_name=SiteName.get_by_value(site_name.lower()),
- proxy=proxy,
- )
-
- site = _scrapers[site_name.lower()](scraper_input)
- results = site.search()
-
- properties_dfs = [_process_result(result) for result in results]
- properties_dfs = [df.dropna(axis=1, how="all") for df in properties_dfs if not df.empty]
- if not properties_dfs:
- return pd.DataFrame()
-
- return pd.concat(properties_dfs, ignore_index=True)
+from .core.scrapers.models import ListingType
+from .exceptions import InvalidListingType, NoResultsFound
def scrape_property(
location: str,
- site_name: Union[str, list[str]] = None,
listing_type: str = "for_sale",
+ radius: float = None,
+ mls_only: bool = False,
+ property_younger_than: int = None,
+ pending_or_contingent: bool = False,
proxy: str = None,
- keep_duplicates: bool = False
) -> pd.DataFrame:
"""
- Scrape property from various sites from a given location and listing type.
-
- :returns: pd.DataFrame
- :param location: US Location (e.g. 'San Francisco, CA', 'Cook County, IL', '85281', '2530 Al Lipscomb Way')
- :param site_name: Site name or list of site names (e.g. ['realtor.com', 'zillow'], 'redfin')
- :param listing_type: Listing type (e.g. 'for_sale', 'for_rent', 'sold')
- :return: pd.DataFrame containing properties
+ Scrape properties from Realtor.com based on a given location and listing type.
+ :param location: Location to search (e.g. "Dallas, TX", "85281", "2530 Al Lipscomb Way")
+ :param listing_type: Listing Type (for_sale, for_rent, sold)
+ :param radius: Get properties within _ (e.g. 1.0) miles. Only applicable for individual addresses.
+ :param mls_only: If set, fetches only listings with MLS IDs.
+ :param property_younger_than: Get properties sold/listed in last _ days.
+ :param pending_or_contingent: If set, fetches only pending or contingent listings. Only applicable for for_sale listings from general area searches.
+ :param proxy: Proxy to use for scraping
"""
- if site_name is None:
- site_name = list(_scrapers.keys())
+ validate_input(listing_type)
- if not isinstance(site_name, list):
- site_name = [site_name]
+ scraper_input = ScraperInput(
+ location=location,
+ listing_type=ListingType[listing_type.upper()],
+ proxy=proxy,
+ radius=radius,
+ mls_only=mls_only,
+ last_x_days=property_younger_than,
+ pending_or_contingent=pending_or_contingent,
+ )
- results = []
+ site = RealtorScraper(scraper_input)
+ results = site.search()
- if len(site_name) == 1:
- final_df = _scrape_single_site(location, site_name[0], listing_type, proxy)
- results.append(final_df)
- else:
- with ThreadPoolExecutor() as executor:
- futures = {
- executor.submit(_scrape_single_site, location, s_name, listing_type, proxy): s_name
- for s_name in site_name
- }
+ properties_dfs = [process_result(result) for result in results]
+ if not properties_dfs:
+ raise NoResultsFound("no results found for the query")
- for future in concurrent.futures.as_completed(futures):
- result = future.result()
- results.append(result)
-
- results = [df for df in results if not df.empty and not df.isna().all().all()]
-
- if not results:
- return pd.DataFrame()
-
- final_df = pd.concat(results, ignore_index=True)
-
- columns_to_track = ["address_one", "address_two", "city"]
-
- #: validate they exist, otherwise create them
- for col in columns_to_track:
- if col not in final_df.columns:
- final_df[col] = None
-
- if not keep_duplicates:
- final_df = final_df.drop_duplicates(subset=columns_to_track, keep="first")
- return final_df
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore", category=FutureWarning)
+ return pd.concat(properties_dfs, ignore_index=True, axis=0)[ordered_properties]
diff --git a/homeharvest/cli.py b/homeharvest/cli.py
index c9deae8..198de12 100644
--- a/homeharvest/cli.py
+++ b/homeharvest/cli.py
@@ -5,15 +5,8 @@ from homeharvest import scrape_property
def main():
parser = argparse.ArgumentParser(description="Home Harvest Property Scraper")
- parser.add_argument("location", type=str, help="Location to scrape (e.g., San Francisco, CA)")
-
parser.add_argument(
- "-s",
- "--site_name",
- type=str,
- nargs="*",
- default=None,
- help="Site name(s) to scrape from (e.g., realtor, zillow)",
+ "location", type=str, help="Location to scrape (e.g., San Francisco, CA)"
)
parser.add_argument(
@@ -43,17 +36,40 @@ def main():
)
parser.add_argument(
- "-k",
- "--keep_duplicates",
- action="store_true",
- help="Keep duplicate properties based on address"
+ "-p", "--proxy", type=str, default=None, help="Proxy to use for scraping"
+ )
+ parser.add_argument(
+ "-d",
+ "--days",
+ type=int,
+ default=None,
+ help="Sold/listed in last _ days filter.",
)
- parser.add_argument("-p", "--proxy", type=str, default=None, help="Proxy to use for scraping")
+ parser.add_argument(
+ "-r",
+ "--radius",
+ type=float,
+ default=None,
+ help="Get comparable properties within _ (eg. 0.0) miles. Only applicable for individual addresses.",
+ )
+ parser.add_argument(
+ "-m",
+ "--mls_only",
+ action="store_true",
+ help="If set, fetches only MLS listings.",
+ )
args = parser.parse_args()
- result = scrape_property(args.location, args.site_name, args.listing_type, proxy=args.proxy, keep_duplicates=args.keep_duplicates)
+ result = scrape_property(
+ args.location,
+ args.listing_type,
+ radius=args.radius,
+ proxy=args.proxy,
+ mls_only=args.mls_only,
+ property_younger_than=args.days,
+ )
if not args.filename:
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
diff --git a/homeharvest/core/scrapers/__init__.py b/homeharvest/core/scrapers/__init__.py
index e900dbe..f82b321 100644
--- a/homeharvest/core/scrapers/__init__.py
+++ b/homeharvest/core/scrapers/__init__.py
@@ -8,12 +8,19 @@ from .models import Property, ListingType, SiteName
class ScraperInput:
location: str
listing_type: ListingType
- site_name: SiteName
+ radius: float | None = None
+ mls_only: bool | None = None
proxy: str | None = None
+ last_x_days: int | None = None
+ pending_or_contingent: bool | None = None
class Scraper:
- def __init__(self, scraper_input: ScraperInput, session: requests.Session | tls_client.Session = None):
+ def __init__(
+ self,
+ scraper_input: ScraperInput,
+ session: requests.Session | tls_client.Session = None,
+ ):
self.location = scraper_input.location
self.listing_type = scraper_input.listing_type
@@ -28,7 +35,10 @@ class Scraper:
self.session.proxies.update(proxies)
self.listing_type = scraper_input.listing_type
- self.site_name = scraper_input.site_name
+ self.radius = scraper_input.radius
+ self.last_x_days = scraper_input.last_x_days
+ self.mls_only = scraper_input.mls_only
+ self.pending_or_contingent = scraper_input.pending_or_contingent
def search(self) -> list[Property]:
...
diff --git a/homeharvest/core/scrapers/models.py b/homeharvest/core/scrapers/models.py
index ed75999..a8ae258 100644
--- a/homeharvest/core/scrapers/models.py
+++ b/homeharvest/core/scrapers/models.py
@@ -1,7 +1,6 @@
from dataclasses import dataclass
from enum import Enum
-from typing import Tuple
-from datetime import datetime
+from typing import Optional
class SiteName(Enum):
@@ -23,98 +22,44 @@ class ListingType(Enum):
SOLD = "SOLD"
-class PropertyType(Enum):
- HOUSE = "HOUSE"
- BUILDING = "BUILDING"
- CONDO = "CONDO"
- TOWNHOUSE = "TOWNHOUSE"
- SINGLE_FAMILY = "SINGLE_FAMILY"
- MULTI_FAMILY = "MULTI_FAMILY"
- MANUFACTURED = "MANUFACTURED"
- NEW_CONSTRUCTION = "NEW_CONSTRUCTION"
- APARTMENT = "APARTMENT"
- APARTMENTS = "APARTMENTS"
- LAND = "LAND"
- LOT = "LOT"
- OTHER = "OTHER"
-
- BLANK = "BLANK"
-
- @classmethod
- def from_int_code(cls, code):
- mapping = {
- 1: cls.HOUSE,
- 2: cls.CONDO,
- 3: cls.TOWNHOUSE,
- 4: cls.MULTI_FAMILY,
- 5: cls.LAND,
- 6: cls.OTHER,
- 8: cls.SINGLE_FAMILY,
- 13: cls.SINGLE_FAMILY,
- }
-
- return mapping.get(code, cls.BLANK)
-
-
@dataclass
class Address:
- address_one: str | None = None
- address_two: str | None = "#"
+ street: str | None = None
+ unit: str | None = None
city: str | None = None
state: str | None = None
- zip_code: str | None = None
+ zip: str | None = None
@dataclass
-class Agent:
- name: str
- phone: str | None = None
- email: str | None = None
+class Description:
+ style: str | None = None
+ beds: int | None = None
+ baths_full: int | None = None
+ baths_half: int | None = None
+ sqft: int | None = None
+ lot_sqft: int | None = None
+ sold_price: int | None = None
+ year_built: int | None = None
+ garage: float | None = None
+ stories: int | None = None
@dataclass
class Property:
property_url: str
- site_name: SiteName
- listing_type: ListingType
- address: Address
- property_type: PropertyType | None = None
-
- # house for sale
- tax_assessed_value: int | None = None
- lot_area_value: float | None = None
- lot_area_unit: str | None = None
- stories: int | None = None
- year_built: int | None = None
- price_per_sqft: int | None = None
+ mls: str | None = None
mls_id: str | None = None
+ status: str | None = None
+ address: Address | None = None
- agent: Agent | None = None
- img_src: str | None = None
- description: str | None = None
- status_text: str | None = None
- posted_time: datetime | None = None
-
- # building for sale
- bldg_name: str | None = None
- area_min: int | None = None
-
- beds_min: int | None = None
- beds_max: int | None = None
-
- baths_min: float | None = None
- baths_max: float | None = None
-
- sqft_min: int | None = None
- sqft_max: int | None = None
-
- price_min: int | None = None
- price_max: int | None = None
-
- unit_count: int | None = None
+ list_price: int | None = None
+ list_date: str | None = None
+ last_sold_date: str | None = None
+ prc_sqft: int | None = None
+ hoa_fee: int | None = None
+ description: Description | None = None
latitude: float | None = None
longitude: float | None = None
-
- sold_date: datetime | None = None
- days_on_market: int | None = None
+ neighborhoods: Optional[str] = None
diff --git a/homeharvest/core/scrapers/realtor/__init__.py b/homeharvest/core/scrapers/realtor/__init__.py
index 78ecc84..fcd96b2 100644
--- a/homeharvest/core/scrapers/realtor/__init__.py
+++ b/homeharvest/core/scrapers/realtor/__init__.py
@@ -2,39 +2,26 @@
homeharvest.realtor.__init__
~~~~~~~~~~~~
-This module implements the scraper for relator.com
+This module implements the scraper for realtor.com
"""
-from ..models import Property, Address
+from typing import Dict, Union, Optional
+from concurrent.futures import ThreadPoolExecutor, as_completed
+
from .. import Scraper
from ....exceptions import NoResultsFound
-from ....utils import parse_address_one, parse_address_two
-from concurrent.futures import ThreadPoolExecutor, as_completed
+from ..models import Property, Address, ListingType, Description
class RealtorScraper(Scraper):
+ SEARCH_GQL_URL = "https://www.realtor.com/api/v1/rdc_search_srp?client_id=rdc-search-new-communities&schema=vesta"
+ PROPERTY_URL = "https://www.realtor.com/realestateandhomes-detail/"
+ ADDRESS_AUTOCOMPLETE_URL = "https://parser-external.geo.moveaws.com/suggest"
+
def __init__(self, scraper_input):
self.counter = 1
super().__init__(scraper_input)
- self.search_url = (
- "https://www.realtor.com/api/v1/rdc_search_srp?client_id=rdc-search-new-communities&schema=vesta"
- )
def handle_location(self):
- headers = {
- "authority": "parser-external.geo.moveaws.com",
- "accept": "*/*",
- "accept-language": "en-US,en;q=0.9",
- "origin": "https://www.realtor.com",
- "referer": "https://www.realtor.com/",
- "sec-ch-ua": '"Chromium";v="116", "Not)A;Brand";v="24", "Google Chrome";v="116"',
- "sec-ch-ua-mobile": "?0",
- "sec-ch-ua-platform": '"Windows"',
- "sec-fetch-dest": "empty",
- "sec-fetch-mode": "cors",
- "sec-fetch-site": "cross-site",
- "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36",
- }
-
params = {
"input": self.location,
"client_id": self.listing_type.value.lower().replace("_", "-"),
@@ -43,9 +30,8 @@ class RealtorScraper(Scraper):
}
response = self.session.get(
- "https://parser-external.geo.moveaws.com/suggest",
+ self.ADDRESS_AUTOCOMPLETE_URL,
params=params,
- headers=headers,
)
response_json = response.json()
@@ -56,6 +42,145 @@ class RealtorScraper(Scraper):
return result[0]
+ def handle_listing(self, listing_id: str) -> list[Property]:
+ query = """query Listing($listing_id: ID!) {
+ listing(id: $listing_id) {
+ source {
+ id
+ listing_id
+ }
+ address {
+ street_number
+ street_name
+ street_suffix
+ unit
+ city
+ state_code
+ postal_code
+ location {
+ coordinate {
+ lat
+ lon
+ }
+ }
+ }
+ basic {
+ sqft
+ beds
+ baths_full
+ baths_half
+ lot_sqft
+ sold_price
+ sold_price
+ type
+ price
+ status
+ sold_date
+ list_date
+ }
+ details {
+ year_built
+ stories
+ garage
+ permalink
+ }
+ }
+ }"""
+
+ variables = {"listing_id": listing_id}
+ payload = {
+ "query": query,
+ "variables": variables,
+ }
+
+ response = self.session.post(self.SEARCH_GQL_URL, json=payload)
+ response_json = response.json()
+
+ property_info = response_json["data"]["listing"]
+
+ mls = (
+ property_info["source"].get("id")
+ if "source" in property_info and isinstance(property_info["source"], dict)
+ else None
+ )
+
+ able_to_get_lat_long = (
+ property_info
+ and property_info.get("address")
+ and property_info["address"].get("location")
+ and property_info["address"]["location"].get("coordinate")
+ )
+
+ listing = Property(
+ mls=mls,
+ mls_id=property_info["source"].get("listing_id")
+ if "source" in property_info and isinstance(property_info["source"], dict)
+ else None,
+ property_url=f"{self.PROPERTY_URL}{property_info['details']['permalink']}",
+ status=property_info["basic"]["status"].upper(),
+ list_price=property_info["basic"]["price"],
+ list_date=property_info["basic"]["list_date"].split("T")[0]
+ if property_info["basic"].get("list_date")
+ else None,
+ prc_sqft=property_info["basic"].get("price") / property_info["basic"].get("sqft")
+ if property_info["basic"].get("price") and property_info["basic"].get("sqft")
+ else None,
+ last_sold_date=property_info["basic"]["sold_date"].split("T")[0]
+ if property_info["basic"].get("sold_date")
+ else None,
+ latitude=property_info["address"]["location"]["coordinate"].get("lat")
+ if able_to_get_lat_long
+ else None,
+ longitude=property_info["address"]["location"]["coordinate"].get("lon")
+ if able_to_get_lat_long
+ else None,
+ address=self._parse_address(property_info, search_type="handle_listing"),
+ description=Description(
+ style=property_info["basic"].get("type", "").upper(),
+ beds=property_info["basic"].get("beds"),
+ baths_full=property_info["basic"].get("baths_full"),
+ baths_half=property_info["basic"].get("baths_half"),
+ sqft=property_info["basic"].get("sqft"),
+ lot_sqft=property_info["basic"].get("lot_sqft"),
+ sold_price=property_info["basic"].get("sold_price"),
+ year_built=property_info["details"].get("year_built"),
+ garage=property_info["details"].get("garage"),
+ stories=property_info["details"].get("stories"),
+ )
+ )
+
+ return [listing]
+
+ def get_latest_listing_id(self, property_id: str) -> str | None:
+ query = """query Property($property_id: ID!) {
+ property(id: $property_id) {
+ listings {
+ listing_id
+ primary
+ }
+ }
+ }
+ """
+
+ variables = {"property_id": property_id}
+ payload = {
+ "query": query,
+ "variables": variables,
+ }
+
+ response = self.session.post(self.SEARCH_GQL_URL, json=payload)
+ response_json = response.json()
+
+ property_info = response_json["data"]["property"]
+ if property_info["listings"] is None:
+ return None
+
+ primary_listing = next((listing for listing in property_info["listings"] if listing["primary"]), None)
+ if primary_listing:
+ return primary_listing["listing_id"]
+ else:
+ return property_info["listings"][0]["listing_id"]
+
def handle_address(self, property_id: str) -> list[Property]:
"""
Handles a specific address & returns one property
@@ -71,22 +196,19 @@ class RealtorScraper(Scraper):
stories
}
address {
- address_validation_code
- city
- country
- county
- line
- postal_code
- state_code
- street_direction
- street_name
street_number
+ street_name
street_suffix
- street_post_direction
- unit_value
unit
- unit_descriptor
- zip
+ city
+ state_code
+ postal_code
+ location {
+ coordinate {
+ lat
+ lon
+ }
+ }
}
basic {
baths
@@ -114,127 +236,175 @@ class RealtorScraper(Scraper):
"variables": variables,
}
- response = self.session.post(self.search_url, json=payload)
+ response = self.session.post(self.SEARCH_GQL_URL, json=payload)
response_json = response.json()
property_info = response_json["data"]["property"]
- address_one, address_two = parse_address_one(property_info["address"]["line"])
return [
Property(
- site_name=self.site_name,
- address=Address(
- address_one=address_one,
- address_two=address_two,
- city=property_info["address"]["city"],
- state=property_info["address"]["state_code"],
- zip_code=property_info["address"]["postal_code"],
- ),
- property_url="https://www.realtor.com/realestateandhomes-detail/"
- + property_info["details"]["permalink"],
- stories=property_info["details"]["stories"],
- year_built=property_info["details"]["year_built"],
- price_per_sqft=property_info["basic"]["price"] // property_info["basic"]["sqft"]
- if property_info["basic"]["sqft"] is not None and property_info["basic"]["price"] is not None
- else None,
mls_id=property_id,
- listing_type=self.listing_type,
- lot_area_value=property_info["public_record"]["lot_size"]
- if property_info["public_record"] is not None
- else None,
- beds_min=property_info["basic"]["beds"],
- beds_max=property_info["basic"]["beds"],
- baths_min=property_info["basic"]["baths"],
- baths_max=property_info["basic"]["baths"],
- sqft_min=property_info["basic"]["sqft"],
- sqft_max=property_info["basic"]["sqft"],
- price_min=property_info["basic"]["price"],
- price_max=property_info["basic"]["price"],
+ property_url=f"{self.PROPERTY_URL}{property_info['details']['permalink']}",
+ address=self._parse_address(
+ property_info, search_type="handle_address"
+ ),
+ description=self._parse_description(property_info),
)
]
- def handle_area(self, variables: dict, return_total: bool = False) -> list[Property] | int:
+ def general_search(
+ self, variables: dict, search_type: str
+ ) -> Dict[str, Union[int, list[Property]]]:
"""
Handles a location area & returns a list of properties
"""
- query = (
- """query Home_search(
- $city: String,
- $county: [String],
- $state_code: String,
- $postal_code: String
- $offset: Int,
- ) {
- home_search(
- query: {
- city: $city
- county: $county
- postal_code: $postal_code
- state_code: $state_code
- status: %s
+ results_query = """{
+ count
+ total
+ results {
+ property_id
+ list_date
+ status
+ last_sold_price
+ last_sold_date
+ list_price
+ price_per_sqft
+ description {
+ sqft
+ beds
+ baths_full
+ baths_half
+ lot_sqft
+ sold_price
+ year_built
+ garage
+ sold_price
+ type
+ name
+ stories
}
- limit: 200
- offset: $offset
- ) {
- count
- total
- results {
- property_id
- description {
- baths
- beds
- lot_sqft
- sqft
- text
- sold_price
- stories
- year_built
- garage
- unit_number
- floor_number
- }
- location {
- address {
- city
- country
- line
- postal_code
- state_code
- state
- street_direction
- street_name
- street_number
- street_post_direction
- street_suffix
- unit
- coordinate {
- lon
- lat
- }
+ source {
+ id
+ listing_id
+ }
+ hoa {
+ fee
+ }
+ location {
+ address {
+ street_number
+ street_name
+ street_suffix
+ unit
+ city
+ state_code
+ postal_code
+ coordinate {
+ lon
+ lat
}
}
- list_price
- price_per_sqft
- source {
- id
+ neighborhoods {
+ name
}
}
}
- }"""
- % self.listing_type.value.lower()
+ }
+ }"""
+
+ date_param = (
+ 'sold_date: { min: "$today-%sD" }' % self.last_x_days
+ if self.listing_type == ListingType.SOLD and self.last_x_days
+ else (
+ 'list_date: { min: "$today-%sD" }' % self.last_x_days
+ if self.last_x_days
+ else ""
+ )
)
+ sort_param = (
+ "sort: [{ field: sold_date, direction: desc }]"
+ if self.listing_type == ListingType.SOLD
+ else "sort: [{ field: list_date, direction: desc }]"
+ )
+
+ pending_or_contingent_param = "or_filters: { contingent: true, pending: true }" if self.pending_or_contingent else ""
+
+ if search_type == "comps": #: comps search, came from an address
+ query = """query Property_search(
+ $coordinates: [Float]!
+ $radius: String!
+ $offset: Int!,
+ ) {
+ property_search(
+ query: {
+ nearby: {
+ coordinates: $coordinates
+ radius: $radius
+ }
+ status: %s
+ %s
+ }
+ %s
+ limit: 200
+ offset: $offset
+ ) %s""" % (
+ self.listing_type.value.lower(),
+ date_param,
+ sort_param,
+ results_query,
+ )
+ elif search_type == "area": #: general search, came from a general location
+ query = """query Home_search(
+ $city: String,
+ $county: [String],
+ $state_code: String,
+ $postal_code: String
+ $offset: Int,
+ ) {
+ home_search(
+ query: {
+ city: $city
+ county: $county
+ postal_code: $postal_code
+ state_code: $state_code
+ status: %s
+ %s
+ %s
+ }
+ %s
+ limit: 200
+ offset: $offset
+ ) %s""" % (
+ self.listing_type.value.lower(),
+ date_param,
+ pending_or_contingent_param,
+ sort_param,
+ results_query,
+ )
+ else: #: general search, came from an address
+ query = (
+ """query Property_search(
+ $property_id: [ID]!
+ $offset: Int!,
+ ) {
+ property_search(
+ query: {
+ property_id: $property_id
+ }
+ limit: 1
+ offset: $offset
+ ) %s""" % results_query)
+
payload = {
"query": query,
"variables": variables,
}
- response = self.session.post(self.search_url, json=payload)
+ response = self.session.post(self.SEARCH_GQL_URL, json=payload)
response.raise_for_status()
response_json = response.json()
-
- if return_total:
- return response_json["data"]["home_search"]["total"]
+ search_key = "home_search" if search_type == "area" else "property_search"
properties: list[Property] = []
@@ -242,89 +412,164 @@ class RealtorScraper(Scraper):
response_json is None
or "data" not in response_json
or response_json["data"] is None
- or "home_search" not in response_json["data"]
- or response_json["data"]["home_search"] is None
- or "results" not in response_json["data"]["home_search"]
+ or search_key not in response_json["data"]
+ or response_json["data"][search_key] is None
+ or "results" not in response_json["data"][search_key]
):
- return []
+ return {"total": 0, "properties": []}
- for result in response_json["data"]["home_search"]["results"]:
+ for result in response_json["data"][search_key]["results"]:
self.counter += 1
- address_one, _ = parse_address_one(result["location"]["address"]["line"])
+ mls = (
+ result["source"].get("id")
+ if "source" in result and isinstance(result["source"], dict)
+ else None
+ )
+
+ if not mls and self.mls_only:
+ continue
+
+ able_to_get_lat_long = (
+ result
+ and result.get("location")
+ and result["location"].get("address")
+ and result["location"]["address"].get("coordinate")
+ )
+
realty_property = Property(
- address=Address(
- address_one=address_one,
- city=result["location"]["address"]["city"],
- state=result["location"]["address"]["state_code"],
- zip_code=result["location"]["address"]["postal_code"],
- address_two=parse_address_two(result["location"]["address"]["unit"]),
- ),
- latitude=result["location"]["address"]["coordinate"]["lat"]
- if result
- and result.get("location")
- and result["location"].get("address")
- and result["location"]["address"].get("coordinate")
- and "lat" in result["location"]["address"]["coordinate"]
+ mls=mls,
+ mls_id=result["source"].get("listing_id")
+ if "source" in result and isinstance(result["source"], dict)
else None,
- longitude=result["location"]["address"]["coordinate"]["lon"]
- if result
- and result.get("location")
- and result["location"].get("address")
- and result["location"]["address"].get("coordinate")
- and "lon" in result["location"]["address"]["coordinate"]
+ property_url=f"{self.PROPERTY_URL}{result['property_id']}",
+ status=result["status"].upper(),
+ list_price=result["list_price"],
+ list_date=result["list_date"].split("T")[0]
+ if result.get("list_date")
else None,
- site_name=self.site_name,
- property_url="https://www.realtor.com/realestateandhomes-detail/" + result["property_id"],
- stories=result["description"]["stories"],
- year_built=result["description"]["year_built"],
- price_per_sqft=result["price_per_sqft"],
- mls_id=result["property_id"],
- listing_type=self.listing_type,
- lot_area_value=result["description"]["lot_sqft"],
- beds_min=result["description"]["beds"],
- beds_max=result["description"]["beds"],
- baths_min=result["description"]["baths"],
- baths_max=result["description"]["baths"],
- sqft_min=result["description"]["sqft"],
- sqft_max=result["description"]["sqft"],
- price_min=result["list_price"],
- price_max=result["list_price"],
+ prc_sqft=result.get("price_per_sqft"),
+ last_sold_date=result.get("last_sold_date"),
+ hoa_fee=result["hoa"]["fee"]
+ if result.get("hoa") and isinstance(result["hoa"], dict)
+ else None,
+ latitude=result["location"]["address"]["coordinate"].get("lat")
+ if able_to_get_lat_long
+ else None,
+ longitude=result["location"]["address"]["coordinate"].get("lon")
+ if able_to_get_lat_long
+ else None,
+ address=self._parse_address(result, search_type="general_search"),
+ #: neighborhoods=self._parse_neighborhoods(result),
+ description=self._parse_description(result),
)
properties.append(realty_property)
- return properties
+ return {
+ "total": response_json["data"][search_key]["total"],
+ "properties": properties,
+ }
def search(self):
location_info = self.handle_location()
location_type = location_info["area_type"]
- if location_type == "address":
- property_id = location_info["mpr_id"]
- return self.handle_address(property_id)
-
- offset = 0
search_variables = {
- "city": location_info.get("city"),
- "county": location_info.get("county"),
- "state_code": location_info.get("state_code"),
- "postal_code": location_info.get("postal_code"),
- "offset": offset,
+ "offset": 0,
}
- total = self.handle_area(search_variables, return_total=True)
+ search_type = "comps" if self.radius and location_type == "address" else "address" if location_type == "address" and not self.radius else "area"
+ if location_type == "address":
+ if not self.radius: #: single address search, non comps
+ property_id = location_info["mpr_id"]
+ search_variables |= {"property_id": property_id}
+
+ gql_results = self.general_search(search_variables, search_type=search_type)
+ if gql_results["total"] == 0:
+ listing_id = self.get_latest_listing_id(property_id)
+ if listing_id is None:
+ return self.handle_address(property_id)
+ else:
+ return self.handle_listing(listing_id)
+ else:
+ return gql_results["properties"]
+
+ else: #: general search, comps (radius)
+ coordinates = list(location_info["centroid"].values())
+ search_variables |= {
+ "coordinates": coordinates,
+ "radius": "{}mi".format(self.radius),
+ }
+
+ else: #: general search, location
+ search_variables |= {
+ "city": location_info.get("city"),
+ "county": location_info.get("county"),
+ "state_code": location_info.get("state_code"),
+ "postal_code": location_info.get("postal_code"),
+ }
+
+ result = self.general_search(search_variables, search_type=search_type)
+ total = result["total"]
+ homes = result["properties"]
- homes = []
with ThreadPoolExecutor(max_workers=10) as executor:
futures = [
executor.submit(
- self.handle_area,
+ self.general_search,
variables=search_variables | {"offset": i},
- return_total=False,
+ search_type=search_type,
)
- for i in range(0, total, 200)
+ for i in range(200, min(total, 10000), 200)
]
for future in as_completed(futures):
- homes.extend(future.result())
+ homes.extend(future.result()["properties"])
return homes
+
+ @staticmethod
+ def _parse_neighborhoods(result: dict) -> Optional[str]:
+ neighborhoods_list = []
+ neighborhoods = result["location"].get("neighborhoods", [])
+
+ if neighborhoods:
+ for neighborhood in neighborhoods:
+ name = neighborhood.get("name")
+ if name:
+ neighborhoods_list.append(name)
+
+ return ", ".join(neighborhoods_list) if neighborhoods_list else None
+
+ @staticmethod
+ def _parse_address(result: dict, search_type):
+ if search_type == "general_search":
+ return Address(
+ street=f"{result['location']['address']['street_number']} {result['location']['address']['street_name']} {result['location']['address']['street_suffix']}",
+ unit=result["location"]["address"]["unit"],
+ city=result["location"]["address"]["city"],
+ state=result["location"]["address"]["state_code"],
+ zip=result["location"]["address"]["postal_code"],
+ )
+ return Address(
+ street=f"{result['address']['street_number']} {result['address']['street_name']} {result['address']['street_suffix']}",
+ unit=result["address"]["unit"],
+ city=result["address"]["city"],
+ state=result["address"]["state_code"],
+ zip=result["address"]["postal_code"],
+ )
+
+ @staticmethod
+ def _parse_description(result: dict) -> Description:
+ description_data = result.get("description", {})
+ return Description(
+ style=description_data.get("type", "").upper(),
+ beds=description_data.get("beds"),
+ baths_full=description_data.get("baths_full"),
+ baths_half=description_data.get("baths_half"),
+ sqft=description_data.get("sqft"),
+ lot_sqft=description_data.get("lot_sqft"),
+ sold_price=description_data.get("sold_price"),
+ year_built=description_data.get("year_built"),
+ garage=description_data.get("garage"),
+ stories=description_data.get("stories"),
+ )
diff --git a/homeharvest/core/scrapers/redfin/__init__.py b/homeharvest/core/scrapers/redfin/__init__.py
deleted file mode 100644
index 80b91f8..0000000
--- a/homeharvest/core/scrapers/redfin/__init__.py
+++ /dev/null
@@ -1,246 +0,0 @@
-"""
-homeharvest.redfin.__init__
-~~~~~~~~~~~~
-
-This module implements the scraper for redfin.com
-"""
-import json
-from typing import Any
-from .. import Scraper
-from ....utils import parse_address_two, parse_address_one
-from ..models import Property, Address, PropertyType, ListingType, SiteName, Agent
-from ....exceptions import NoResultsFound, SearchTooBroad
-from datetime import datetime
-
-
-class RedfinScraper(Scraper):
- def __init__(self, scraper_input):
- super().__init__(scraper_input)
- self.listing_type = scraper_input.listing_type
-
- def _handle_location(self):
- url = "https://www.redfin.com/stingray/do/location-autocomplete?v=2&al=1&location={}".format(self.location)
-
- response = self.session.get(url)
- response_json = json.loads(response.text.replace("{}&&", ""))
-
- def get_region_type(match_type: str):
- if match_type == "4":
- return "2" #: zip
- elif match_type == "2":
- return "6" #: city
- elif match_type == "1":
- return "address" #: address, needs to be handled differently
- elif match_type == "11":
- return "state"
-
- if "exactMatch" not in response_json["payload"]:
- raise NoResultsFound("No results found for location: {}".format(self.location))
-
- if response_json["payload"]["exactMatch"] is not None:
- target = response_json["payload"]["exactMatch"]
- else:
- target = response_json["payload"]["sections"][0]["rows"][0]
-
- return target["id"].split("_")[1], get_region_type(target["type"])
-
- def _parse_home(self, home: dict, single_search: bool = False) -> Property:
- def get_value(key: str) -> Any | None:
- if key in home and "value" in home[key]:
- return home[key]["value"]
-
- if not single_search:
- address = Address(
- address_one=parse_address_one(get_value("streetLine"))[0],
- address_two=parse_address_one(get_value("streetLine"))[1],
- city=home.get("city"),
- state=home.get("state"),
- zip_code=home.get("zip"),
- )
- else:
- address_info = home.get("streetAddress")
- address_one, address_two = parse_address_one(address_info.get("assembledAddress"))
-
- address = Address(
- address_one=address_one,
- address_two=address_two,
- city=home.get("city"),
- state=home.get("state"),
- zip_code=home.get("zip"),
- )
-
- url = "https://www.redfin.com{}".format(home["url"])
- lot_size_data = home.get("lotSize")
-
- if not isinstance(lot_size_data, int):
- lot_size = lot_size_data.get("value", None) if isinstance(lot_size_data, dict) else None
- else:
- lot_size = lot_size_data
-
- lat_long = get_value("latLong")
-
- return Property(
- site_name=self.site_name,
- listing_type=self.listing_type,
- address=address,
- property_url=url,
- beds_min=home["beds"] if "beds" in home else None,
- beds_max=home["beds"] if "beds" in home else None,
- baths_min=home["baths"] if "baths" in home else None,
- baths_max=home["baths"] if "baths" in home else None,
- price_min=get_value("price"),
- price_max=get_value("price"),
- sqft_min=get_value("sqFt"),
- sqft_max=get_value("sqFt"),
- stories=home["stories"] if "stories" in home else None,
- agent=Agent( #: listingAgent, some have sellingAgent as well
- name=home['listingAgent'].get('name') if 'listingAgent' in home else None,
- phone=home['listingAgent'].get('phone') if 'listingAgent' in home else None,
- ),
- description=home["listingRemarks"] if "listingRemarks" in home else None,
- year_built=get_value("yearBuilt") if not single_search else home.get("yearBuilt"),
- lot_area_value=lot_size,
- property_type=PropertyType.from_int_code(home.get("propertyType")),
- price_per_sqft=get_value("pricePerSqFt") if type(home.get("pricePerSqFt")) != int else home.get("pricePerSqFt"),
- mls_id=get_value("mlsId"),
- latitude=lat_long.get('latitude') if lat_long else None,
- longitude=lat_long.get('longitude') if lat_long else None,
- sold_date=datetime.fromtimestamp(home['soldDate'] / 1000) if 'soldDate' in home else None,
- days_on_market=get_value("dom")
- )
-
- def _handle_rentals(self, region_id, region_type):
- url = f"https://www.redfin.com/stingray/api/v1/search/rentals?al=1&isRentals=true®ion_id={region_id}®ion_type={region_type}&num_homes=100000"
-
- response = self.session.get(url)
- response.raise_for_status()
- homes = response.json()
-
- properties_list = []
-
- for home in homes["homes"]:
- home_data = home["homeData"]
- rental_data = home["rentalExtension"]
-
- property_url = f"https://www.redfin.com{home_data.get('url', '')}"
- address_info = home_data.get("addressInfo", {})
- centroid = address_info.get("centroid", {}).get("centroid", {})
- address = Address(
- address_one=parse_address_one(address_info.get("formattedStreetLine"))[0],
- city=address_info.get("city"),
- state=address_info.get("state"),
- zip_code=address_info.get("zip"),
- )
-
- price_range = rental_data.get("rentPriceRange", {"min": None, "max": None})
- bed_range = rental_data.get("bedRange", {"min": None, "max": None})
- bath_range = rental_data.get("bathRange", {"min": None, "max": None})
- sqft_range = rental_data.get("sqftRange", {"min": None, "max": None})
-
- property_ = Property(
- property_url=property_url,
- site_name=SiteName.REDFIN,
- listing_type=ListingType.FOR_RENT,
- address=address,
- description=rental_data.get("description"),
- latitude=centroid.get("latitude"),
- longitude=centroid.get("longitude"),
- baths_min=bath_range.get("min"),
- baths_max=bath_range.get("max"),
- beds_min=bed_range.get("min"),
- beds_max=bed_range.get("max"),
- price_min=price_range.get("min"),
- price_max=price_range.get("max"),
- sqft_min=sqft_range.get("min"),
- sqft_max=sqft_range.get("max"),
- img_src=home_data.get("staticMapUrl"),
- posted_time=rental_data.get("lastUpdated"),
- bldg_name=rental_data.get("propertyName"),
- )
-
- properties_list.append(property_)
-
- if not properties_list:
- raise NoResultsFound("No rentals found for the given location.")
-
- return properties_list
-
- def _parse_building(self, building: dict) -> Property:
- street_address = " ".join(
- [
- building["address"]["streetNumber"],
- building["address"]["directionalPrefix"],
- building["address"]["streetName"],
- building["address"]["streetType"],
- ]
- )
- return Property(
- site_name=self.site_name,
- property_type=PropertyType("BUILDING"),
- address=Address(
- address_one=parse_address_one(street_address)[0],
- city=building["address"]["city"],
- state=building["address"]["stateOrProvinceCode"],
- zip_code=building["address"]["postalCode"],
- address_two=parse_address_two(
- " ".join(
- [
- building["address"]["unitType"],
- building["address"]["unitValue"],
- ]
- )
- ),
- ),
- property_url="https://www.redfin.com{}".format(building["url"]),
- listing_type=self.listing_type,
- unit_count=building.get("numUnitsForSale"),
- )
-
- def handle_address(self, home_id: str):
- """
- EPs:
- https://www.redfin.com/stingray/api/home/details/initialInfo?al=1&path=/TX/Austin/70-Rainey-St-78701/unit-1608/home/147337694
- https://www.redfin.com/stingray/api/home/details/mainHouseInfoPanelInfo?propertyId=147337694&accessLevel=3
- https://www.redfin.com/stingray/api/home/details/aboveTheFold?propertyId=147337694&accessLevel=3
- https://www.redfin.com/stingray/api/home/details/belowTheFold?propertyId=147337694&accessLevel=3
- """
- url = "https://www.redfin.com/stingray/api/home/details/aboveTheFold?propertyId={}&accessLevel=3".format(
- home_id
- )
-
- response = self.session.get(url)
- response_json = json.loads(response.text.replace("{}&&", ""))
-
- parsed_home = self._parse_home(response_json["payload"]["addressSectionInfo"], single_search=True)
- return [parsed_home]
-
- def search(self):
- region_id, region_type = self._handle_location()
-
- if region_type == "state":
- raise SearchTooBroad("State searches are not supported, please use a more specific location.")
-
- if region_type == "address":
- home_id = region_id
- return self.handle_address(home_id)
-
- if self.listing_type == ListingType.FOR_RENT:
- return self._handle_rentals(region_id, region_type)
- else:
- if self.listing_type == ListingType.FOR_SALE:
- url = f"https://www.redfin.com/stingray/api/gis?al=1®ion_id={region_id}®ion_type={region_type}&num_homes=100000"
- else:
- url = f"https://www.redfin.com/stingray/api/gis?al=1®ion_id={region_id}®ion_type={region_type}&sold_within_days=30&num_homes=100000"
- response = self.session.get(url)
- response_json = json.loads(response.text.replace("{}&&", ""))
-
- if "payload" in response_json:
- homes_list = response_json["payload"].get("homes", [])
- buildings_list = response_json["payload"].get("buildings", {}).values()
-
- homes = [self._parse_home(home) for home in homes_list] + [
- self._parse_building(building) for building in buildings_list
- ]
- return homes
- else:
- return []
diff --git a/homeharvest/core/scrapers/zillow/__init__.py b/homeharvest/core/scrapers/zillow/__init__.py
deleted file mode 100644
index ba55a01..0000000
--- a/homeharvest/core/scrapers/zillow/__init__.py
+++ /dev/null
@@ -1,335 +0,0 @@
-"""
-homeharvest.zillow.__init__
-~~~~~~~~~~~~
-
-This module implements the scraper for zillow.com
-"""
-import re
-import json
-
-import tls_client
-
-from .. import Scraper
-from requests.exceptions import HTTPError
-from ....utils import parse_address_one, parse_address_two
-from ....exceptions import GeoCoordsNotFound, NoResultsFound
-from ..models import Property, Address, ListingType, PropertyType, Agent
-import urllib.parse
-from datetime import datetime, timedelta
-
-
-class ZillowScraper(Scraper):
- def __init__(self, scraper_input):
- session = tls_client.Session(
- client_identifier="chrome112", random_tls_extension_order=True
- )
-
- super().__init__(scraper_input, session)
-
- self.session.headers.update({
- 'authority': 'www.zillow.com',
- 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
- 'accept-language': 'en-US,en;q=0.9',
- 'cache-control': 'max-age=0',
- 'sec-fetch-dest': 'document',
- 'sec-fetch-mode': 'navigate',
- 'sec-fetch-site': 'same-origin',
- 'sec-fetch-user': '?1',
- 'upgrade-insecure-requests': '1',
- 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',
- })
-
- if not self.is_plausible_location(self.location):
- raise NoResultsFound("Invalid location input: {}".format(self.location))
-
- listing_type_to_url_path = {
- ListingType.FOR_SALE: "for_sale",
- ListingType.FOR_RENT: "for_rent",
- ListingType.SOLD: "recently_sold",
- }
-
- self.url = f"https://www.zillow.com/homes/{listing_type_to_url_path[self.listing_type]}/{self.location}_rb/"
-
- def is_plausible_location(self, location: str) -> bool:
- url = (
- "https://www.zillowstatic.com/autocomplete/v3/suggestions?q={"
- "}&abKey=6666272a-4b99-474c-b857-110ec438732b&clientId=homepage-render"
- ).format(urllib.parse.quote(location))
-
- resp = self.session.get(url)
-
- return resp.json()["results"] != []
-
- def search(self):
- resp = self.session.get(self.url)
- if resp.status_code != 200:
- raise HTTPError(
- f"bad response status code: {resp.status_code}"
- )
- content = resp.text
-
- match = re.search(
- r'',
- content,
- re.DOTALL,
- )
- if not match:
- raise NoResultsFound("No results were found for Zillow with the given Location.")
-
- json_str = match.group(1)
- data = json.loads(json_str)
-
- if "searchPageState" in data["props"]["pageProps"]:
- pattern = r'window\.mapBounds = \{\s*"west":\s*(-?\d+\.\d+),\s*"east":\s*(-?\d+\.\d+),\s*"south":\s*(-?\d+\.\d+),\s*"north":\s*(-?\d+\.\d+)\s*\};'
-
- match = re.search(pattern, content)
-
- if match:
- coords = [float(coord) for coord in match.groups()]
- return self._fetch_properties_backend(coords)
-
- else:
- raise GeoCoordsNotFound("Box bounds could not be located.")
-
- elif "gdpClientCache" in data["props"]["pageProps"]:
- gdp_client_cache = json.loads(data["props"]["pageProps"]["gdpClientCache"])
- main_key = list(gdp_client_cache.keys())[0]
-
- property_data = gdp_client_cache[main_key]["property"]
- property = self._get_single_property_page(property_data)
-
- return [property]
- raise NoResultsFound("Specific property data not found in the response.")
-
- def _fetch_properties_backend(self, coords):
- url = "https://www.zillow.com/async-create-search-page-state"
-
- filter_state_for_sale = {
- "sortSelection": {
- # "value": "globalrelevanceex"
- "value": "days"
- },
- "isAllHomes": {"value": True},
- }
-
- filter_state_for_rent = {
- "isForRent": {"value": True},
- "isForSaleByAgent": {"value": False},
- "isForSaleByOwner": {"value": False},
- "isNewConstruction": {"value": False},
- "isComingSoon": {"value": False},
- "isAuction": {"value": False},
- "isForSaleForeclosure": {"value": False},
- "isAllHomes": {"value": True},
- }
-
- filter_state_sold = {
- "isRecentlySold": {"value": True},
- "isForSaleByAgent": {"value": False},
- "isForSaleByOwner": {"value": False},
- "isNewConstruction": {"value": False},
- "isComingSoon": {"value": False},
- "isAuction": {"value": False},
- "isForSaleForeclosure": {"value": False},
- "isAllHomes": {"value": True},
- }
-
- selected_filter = (
- filter_state_for_rent
- if self.listing_type == ListingType.FOR_RENT
- else filter_state_for_sale
- if self.listing_type == ListingType.FOR_SALE
- else filter_state_sold
- )
-
- payload = {
- "searchQueryState": {
- "pagination": {},
- "isMapVisible": True,
- "mapBounds": {
- "west": coords[0],
- "east": coords[1],
- "south": coords[2],
- "north": coords[3],
- },
- "filterState": selected_filter,
- "isListVisible": True,
- "mapZoom": 11,
- },
- "wants": {"cat1": ["mapResults"]},
- "isDebugRequest": False,
- }
- resp = self.session.put(url, json=payload)
- if resp.status_code != 200:
- raise HTTPError(
- f"bad response status code: {resp.status_code}"
- )
- return self._parse_properties(resp.json())
-
- @staticmethod
- def parse_posted_time(time: str) -> datetime:
- int_time = int(time.split(" ")[0])
-
- if "hour" in time:
- return datetime.now() - timedelta(hours=int_time)
-
- if "day" in time:
- return datetime.now() - timedelta(days=int_time)
-
- def _parse_properties(self, property_data: dict):
- mapresults = property_data["cat1"]["searchResults"]["mapResults"]
-
- properties_list = []
-
- for result in mapresults:
- if "hdpData" in result:
- home_info = result["hdpData"]["homeInfo"]
- address_data = {
- "address_one": parse_address_one(home_info.get("streetAddress"))[0],
- "address_two": parse_address_two(home_info["unit"]) if "unit" in home_info else "#",
- "city": home_info.get("city"),
- "state": home_info.get("state"),
- "zip_code": home_info.get("zipcode"),
- }
- property_obj = Property(
- site_name=self.site_name,
- address=Address(**address_data),
- property_url=f"https://www.zillow.com{result['detailUrl']}",
- tax_assessed_value=int(home_info["taxAssessedValue"]) if "taxAssessedValue" in home_info else None,
- property_type=PropertyType(home_info.get("homeType")),
- listing_type=ListingType(
- home_info["statusType"] if "statusType" in home_info else self.listing_type
- ),
- status_text=result.get("statusText"),
- posted_time=self.parse_posted_time(result["variableData"]["text"])
- if "variableData" in result
- and "text" in result["variableData"]
- and result["variableData"]["type"] == "TIME_ON_INFO"
- else None,
- price_min=home_info.get("price"),
- price_max=home_info.get("price"),
- beds_min=int(home_info["bedrooms"]) if "bedrooms" in home_info else None,
- beds_max=int(home_info["bedrooms"]) if "bedrooms" in home_info else None,
- baths_min=home_info.get("bathrooms"),
- baths_max=home_info.get("bathrooms"),
- sqft_min=int(home_info["livingArea"]) if "livingArea" in home_info else None,
- sqft_max=int(home_info["livingArea"]) if "livingArea" in home_info else None,
- price_per_sqft=int(home_info["price"] // home_info["livingArea"])
- if "livingArea" in home_info and home_info["livingArea"] != 0 and "price" in home_info
- else None,
- latitude=result["latLong"]["latitude"],
- longitude=result["latLong"]["longitude"],
- lot_area_value=round(home_info["lotAreaValue"], 2) if "lotAreaValue" in home_info else None,
- lot_area_unit=home_info.get("lotAreaUnit"),
- img_src=result.get("imgSrc"),
- )
-
- properties_list.append(property_obj)
-
- elif "isBuilding" in result:
- price_string = result["price"].replace("$", "").replace(",", "").replace("+/mo", "")
-
- match = re.search(r"(\d+)", price_string)
- price_value = int(match.group(1)) if match else None
- building_obj = Property(
- property_url=f"https://www.zillow.com{result['detailUrl']}",
- site_name=self.site_name,
- property_type=PropertyType("BUILDING"),
- listing_type=ListingType(result["statusType"]),
- img_src=result.get("imgSrc"),
- address=self._extract_address(result["address"]),
- baths_min=result.get("minBaths"),
- area_min=result.get("minArea"),
- bldg_name=result.get("communityName"),
- status_text=result.get("statusText"),
- price_min=price_value if "+/mo" in result.get("price") else None,
- price_max=price_value if "+/mo" in result.get("price") else None,
- latitude=result.get("latLong", {}).get("latitude"),
- longitude=result.get("latLong", {}).get("longitude"),
- unit_count=result.get("unitCount"),
- )
-
- properties_list.append(building_obj)
-
- return properties_list
-
- def _get_single_property_page(self, property_data: dict):
- """
- This method is used when a user enters the exact location & zillow returns just one property
- """
- url = (
- f"https://www.zillow.com{property_data['hdpUrl']}"
- if "zillow.com" not in property_data["hdpUrl"]
- else property_data["hdpUrl"]
- )
- address_data = property_data["address"]
- address_one, address_two = parse_address_one(address_data["streetAddress"])
- address = Address(
- address_one=address_one,
- address_two=address_two if address_two else "#",
- city=address_data["city"],
- state=address_data["state"],
- zip_code=address_data["zipcode"],
- )
- property_type = property_data.get("homeType", None)
- return Property(
- site_name=self.site_name,
- property_url=url,
- property_type=PropertyType(property_type) if property_type in PropertyType.__members__ else None,
- listing_type=self.listing_type,
- address=address,
- year_built=property_data.get("yearBuilt"),
- tax_assessed_value=property_data.get("taxAssessedValue"),
- lot_area_value=property_data.get("lotAreaValue"),
- lot_area_unit=property_data["lotAreaUnits"].lower() if "lotAreaUnits" in property_data else None,
- agent=Agent(
- name=property_data.get("attributionInfo", {}).get("agentName")
- ),
- stories=property_data.get("resoFacts", {}).get("stories"),
- mls_id=property_data.get("attributionInfo", {}).get("mlsId"),
- beds_min=property_data.get("bedrooms"),
- beds_max=property_data.get("bedrooms"),
- baths_min=property_data.get("bathrooms"),
- baths_max=property_data.get("bathrooms"),
- price_min=property_data.get("price"),
- price_max=property_data.get("price"),
- sqft_min=property_data.get("livingArea"),
- sqft_max=property_data.get("livingArea"),
- price_per_sqft=property_data.get("resoFacts", {}).get("pricePerSquareFoot"),
- latitude=property_data.get("latitude"),
- longitude=property_data.get("longitude"),
- img_src=property_data.get("streetViewTileImageUrlMediumAddress"),
- description=property_data.get("description"),
- )
-
- def _extract_address(self, address_str):
- """
- Extract address components from a string formatted like '555 Wedglea Dr, Dallas, TX',
- and return an Address object.
- """
- parts = address_str.split(", ")
-
- if len(parts) != 3:
- raise ValueError(f"Unexpected address format: {address_str}")
-
- address_one = parts[0].strip()
- city = parts[1].strip()
- state_zip = parts[2].split(" ")
-
- if len(state_zip) == 1:
- state = state_zip[0].strip()
- zip_code = None
- elif len(state_zip) == 2:
- state = state_zip[0].strip()
- zip_code = state_zip[1].strip()
- else:
- raise ValueError(f"Unexpected state/zip format in address: {address_str}")
-
- address_one, address_two = parse_address_one(address_one)
- return Address(
- address_one=address_one,
- address_two=address_two if address_two else "#",
- city=city,
- state=state,
- zip_code=zip_code,
- )
diff --git a/homeharvest/exceptions.py b/homeharvest/exceptions.py
index 95eedbc..f018c97 100644
--- a/homeharvest/exceptions.py
+++ b/homeharvest/exceptions.py
@@ -1,18 +1,6 @@
-class InvalidSite(Exception):
- """Raised when a provided site is does not exist."""
-
-
class InvalidListingType(Exception):
"""Raised when a provided listing type is does not exist."""
class NoResultsFound(Exception):
"""Raised when no results are found for the given location"""
-
-
-class GeoCoordsNotFound(Exception):
- """Raised when no property is found for the given address"""
-
-
-class SearchTooBroad(Exception):
- """Raised when the search is too broad"""
diff --git a/homeharvest/utils.py b/homeharvest/utils.py
index 2aeedee..5d125e1 100644
--- a/homeharvest/utils.py
+++ b/homeharvest/utils.py
@@ -1,38 +1,71 @@
-import re
+from .core.scrapers.models import Property, ListingType
+import pandas as pd
+from .exceptions import InvalidListingType
+
+ordered_properties = [
+ "property_url",
+ "mls",
+ "mls_id",
+ "status",
+ "style",
+ "street",
+ "unit",
+ "city",
+ "state",
+ "zip_code",
+ "beds",
+ "full_baths",
+ "half_baths",
+ "sqft",
+ "year_built",
+ "list_price",
+ "list_date",
+ "sold_price",
+ "last_sold_date",
+ "lot_sqft",
+ "price_per_sqft",
+ "latitude",
+ "longitude",
+ "stories",
+ "hoa_fee",
+ "parking_garage",
+]
-def parse_address_one(street_address: str) -> tuple:
- if not street_address:
- return street_address, "#"
+def process_result(result: Property) -> pd.DataFrame:
+ prop_data = {prop: None for prop in ordered_properties}
+ prop_data.update(result.__dict__)
- apt_match = re.search(
- r"(APT\s*[\dA-Z]+|#[\dA-Z]+|UNIT\s*[\dA-Z]+|LOT\s*[\dA-Z]+|SUITE\s*[\dA-Z]+)$",
- street_address,
- re.I,
- )
+ if "address" in prop_data:
+ address_data = prop_data["address"]
+ prop_data["street"] = address_data.street
+ prop_data["unit"] = address_data.unit
+ prop_data["city"] = address_data.city
+ prop_data["state"] = address_data.state
+ prop_data["zip_code"] = address_data.zip
- if apt_match:
- apt_str = apt_match.group().strip()
- cleaned_apt_str = re.sub(r"(APT\s*|UNIT\s*|LOT\s*|SUITE\s*)", "#", apt_str, flags=re.I)
+ prop_data["price_per_sqft"] = prop_data["prc_sqft"]
- main_address = street_address.replace(apt_str, "").strip()
- return main_address, cleaned_apt_str
- else:
- return street_address, "#"
+ description = result.description
+ prop_data["style"] = description.style
+ prop_data["beds"] = description.beds
+ prop_data["full_baths"] = description.baths_full
+ prop_data["half_baths"] = description.baths_half
+ prop_data["sqft"] = description.sqft
+ prop_data["lot_sqft"] = description.lot_sqft
+ prop_data["sold_price"] = description.sold_price
+ prop_data["year_built"] = description.year_built
+ prop_data["parking_garage"] = description.garage
+ prop_data["stories"] = description.stories
+
+ properties_df = pd.DataFrame([prop_data])
+ properties_df = properties_df.reindex(columns=ordered_properties)
+
+ return properties_df[ordered_properties]
-def parse_address_two(street_address: str):
- if not street_address:
- return "#"
- apt_match = re.search(
- r"(APT\s*[\dA-Z]+|#[\dA-Z]+|UNIT\s*[\dA-Z]+|LOT\s*[\dA-Z]+|SUITE\s*[\dA-Z]+)$",
- street_address,
- re.I,
- )
-
- if apt_match:
- apt_str = apt_match.group().strip()
- apt_str = re.sub(r"(APT\s*|UNIT\s*|LOT\s*|SUITE\s*)", "#", apt_str, flags=re.I)
- return apt_str
- else:
- return "#"
+def validate_input(listing_type: str) -> None:
+ if listing_type.upper() not in ListingType.__members__:
+ raise InvalidListingType(
+ f"Provided listing type, '{listing_type}', does not exist."
+ )
diff --git a/pyproject.toml b/pyproject.toml
index 5724747..c79f04e 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -1,6 +1,6 @@
[tool.poetry]
name = "homeharvest"
-version = "0.2.19"
+version = "0.3.0"
description = "Real estate scraping library supporting Zillow, Realtor.com & Redfin."
authors = ["Zachary Hampton ", "Cullen Watson "]
homepage = "https://github.com/ZacharyHampton/HomeHarvest"
diff --git a/tests/test_realtor.py b/tests/test_realtor.py
index 3b23529..a498112 100644
--- a/tests/test_realtor.py
+++ b/tests/test_realtor.py
@@ -1,26 +1,78 @@
from homeharvest import scrape_property
from homeharvest.exceptions import (
- InvalidSite,
InvalidListingType,
NoResultsFound,
- GeoCoordsNotFound,
)
+def test_realtor_pending_or_contingent():
+ pending_or_contingent_result = scrape_property(
+ location="Surprise, AZ",
+ pending_or_contingent=True,
+ )
+
+ regular_result = scrape_property(
+ location="Surprise, AZ",
+ pending_or_contingent=False,
+ )
+
+ assert all([result is not None for result in [pending_or_contingent_result, regular_result]])
+ assert len(pending_or_contingent_result) != len(regular_result)
+
+
+def test_realtor_comps():
+ result = scrape_property(
+ location="2530 Al Lipscomb Way",
+ radius=0.5,
+ property_younger_than=180,
+ listing_type="sold",
+ )
+
+ assert result is not None and len(result) > 0
+
+
+def test_realtor_last_x_days_sold():
+ days_result_30 = scrape_property(
+ location="Dallas, TX", listing_type="sold", property_younger_than=30
+ )
+
+ days_result_10 = scrape_property(
+ location="Dallas, TX", listing_type="sold", property_younger_than=10
+ )
+
+ assert all(
+ [result is not None for result in [days_result_30, days_result_10]]
+ ) and len(days_result_30) != len(days_result_10)
+
+
+def test_realtor_single_property():
+ results = [
+ scrape_property(
+ location="15509 N 172nd Dr, Surprise, AZ 85388",
+ listing_type="for_sale",
+ ),
+ scrape_property(
+ location="2530 Al Lipscomb Way",
+ listing_type="for_sale",
+ ),
+ ]
+
+ assert all([result is not None for result in results])
+
+
def test_realtor():
results = [
scrape_property(
location="2530 Al Lipscomb Way",
- site_name="realtor.com",
listing_type="for_sale",
),
scrape_property(
- location="Phoenix, AZ", site_name=["realtor.com"], listing_type="for_rent"
+ location="Phoenix, AZ", listing_type="for_rent"
), #: does not support "city, state, USA" format
scrape_property(
- location="Dallas, TX", site_name="realtor.com", listing_type="sold"
+ location="Dallas, TX", listing_type="sold"
), #: does not support "city, state, USA" format
- scrape_property(location="85281", site_name="realtor.com"),
+ scrape_property(location="85281"),
]
assert all([result is not None for result in results])
@@ -30,11 +82,10 @@ def test_realtor():
bad_results += [
scrape_property(
location="abceefg ju098ot498hh9",
- site_name="realtor.com",
listing_type="for_sale",
)
]
- except (InvalidSite, InvalidListingType, NoResultsFound, GeoCoordsNotFound):
+ except (InvalidListingType, NoResultsFound):
assert True
assert all([result is None for result in bad_results])
diff --git a/tests/test_redfin.py b/tests/test_redfin.py
deleted file mode 100644
index 6904499..0000000
--- a/tests/test_redfin.py
+++ /dev/null
@@ -1,35 +0,0 @@
-from homeharvest import scrape_property
-from homeharvest.exceptions import (
- InvalidSite,
- InvalidListingType,
- NoResultsFound,
- GeoCoordsNotFound,
- SearchTooBroad,
-)
-
-
-def test_redfin():
- results = [
- scrape_property(location="San Diego", site_name="redfin", listing_type="for_sale"),
- scrape_property(location="2530 Al Lipscomb Way", site_name="redfin", listing_type="for_sale"),
- scrape_property(location="Phoenix, AZ, USA", site_name=["redfin"], listing_type="for_rent"),
- scrape_property(location="Dallas, TX, USA", site_name="redfin", listing_type="sold"),
- scrape_property(location="85281", site_name="redfin"),
- ]
-
- assert all([result is not None for result in results])
-
- bad_results = []
- try:
- bad_results += [
- scrape_property(
- location="abceefg ju098ot498hh9",
- site_name="redfin",
- listing_type="for_sale",
- ),
- scrape_property(location="Florida", site_name="redfin", listing_type="for_rent"),
- ]
- except (InvalidSite, InvalidListingType, NoResultsFound, GeoCoordsNotFound, SearchTooBroad):
- assert True
-
- assert all([result is None for result in bad_results])
diff --git a/tests/test_utils.py b/tests/test_utils.py
deleted file mode 100644
index d21ee77..0000000
--- a/tests/test_utils.py
+++ /dev/null
@@ -1,24 +0,0 @@
-from homeharvest.utils import parse_address_one, parse_address_two
-
-
-def test_parse_address_one():
- test_data = [
- ("4303 E Cactus Rd Apt 126", ("4303 E Cactus Rd", "#126")),
- ("1234 Elm Street apt 2B", ("1234 Elm Street", "#2B")),
- ("1234 Elm Street UNIT 3A", ("1234 Elm Street", "#3A")),
- ("1234 Elm Street unit 3A", ("1234 Elm Street", "#3A")),
- ("1234 Elm Street SuIte 3A", ("1234 Elm Street", "#3A")),
- ]
-
- for input_data, (exp_addr_one, exp_addr_two) in test_data:
- address_one, address_two = parse_address_one(input_data)
- assert address_one == exp_addr_one
- assert address_two == exp_addr_two
-
-
-def test_parse_address_two():
- test_data = [("Apt 126", "#126"), ("apt 2B", "#2B"), ("UNIT 3A", "#3A"), ("unit 3A", "#3A"), ("SuIte 3A", "#3A")]
-
- for input_data, expected in test_data:
- output = parse_address_two(input_data)
- assert output == expected
diff --git a/tests/test_zillow.py b/tests/test_zillow.py
deleted file mode 100644
index dfcc55d..0000000
--- a/tests/test_zillow.py
+++ /dev/null
@@ -1,34 +0,0 @@
-from homeharvest import scrape_property
-from homeharvest.exceptions import (
- InvalidSite,
- InvalidListingType,
- NoResultsFound,
- GeoCoordsNotFound,
-)
-
-
-def test_zillow():
- results = [
- scrape_property(location="2530 Al Lipscomb Way", site_name="zillow", listing_type="for_sale"),
- scrape_property(location="Phoenix, AZ, USA", site_name=["zillow"], listing_type="for_rent"),
- scrape_property(location="Surprise, AZ", site_name=["zillow"], listing_type="for_sale"),
- scrape_property(location="Dallas, TX, USA", site_name="zillow", listing_type="sold"),
- scrape_property(location="85281", site_name="zillow"),
- scrape_property(location="3268 88th st s, Lakewood", site_name="zillow", listing_type="for_rent"),
- ]
-
- assert all([result is not None for result in results])
-
- bad_results = []
- try:
- bad_results += [
- scrape_property(
- location="abceefg ju098ot498hh9",
- site_name="zillow",
- listing_type="for_sale",
- )
- ]
- except (InvalidSite, InvalidListingType, NoResultsFound, GeoCoordsNotFound):
- assert True
-
- assert all([result is None for result in bad_results])