commit
4a1116440d
202
README.md
202
README.md
|
@ -1,6 +1,6 @@
|
||||||
<img src="https://github.com/ZacharyHampton/HomeHarvest/assets/78247585/d1a2bf8b-09f5-4c57-b33a-0ada8a34f12d" width="400">
|
<img src="https://github.com/ZacharyHampton/HomeHarvest/assets/78247585/d1a2bf8b-09f5-4c57-b33a-0ada8a34f12d" width="400">
|
||||||
|
|
||||||
**HomeHarvest** is a simple, yet comprehensive, real estate scraping library.
|
**HomeHarvest** is a simple, yet comprehensive, real estate scraping library that extracts and formats data in the style of MLS listings.
|
||||||
|
|
||||||
[![Try with Replit](https://replit.com/badge?caption=Try%20with%20Replit)](https://replit.com/@ZacharyHampton/HomeHarvestDemo)
|
[![Try with Replit](https://replit.com/badge?caption=Try%20with%20Replit)](https://replit.com/@ZacharyHampton/HomeHarvestDemo)
|
||||||
|
|
||||||
|
@ -11,10 +11,14 @@
|
||||||
|
|
||||||
Check out another project we wrote: ***[JobSpy](https://github.com/cullenwatson/JobSpy)** – a Python package for job scraping*
|
Check out another project we wrote: ***[JobSpy](https://github.com/cullenwatson/JobSpy)** – a Python package for job scraping*
|
||||||
|
|
||||||
## Features
|
## HomeHarvest Features
|
||||||
|
|
||||||
- Scrapes properties from **Zillow**, **Realtor.com** & **Redfin** simultaneously
|
- **Source**: Fetches properties directly from **Realtor.com**.
|
||||||
- Aggregates the properties in a Pandas DataFrame
|
- **Data Format**: Structures data to resemble MLS listings.
|
||||||
|
- **Export Flexibility**: Options to save as either CSV or Excel.
|
||||||
|
- **Usage Modes**:
|
||||||
|
- **CLI**: For users who prefer command-line operations.
|
||||||
|
- **Python**: For those who'd like to integrate scraping into their Python scripts.
|
||||||
|
|
||||||
[Video Guide for HomeHarvest](https://youtu.be/JnV7eR2Ve2o) - _updated for release v0.2.7_
|
[Video Guide for HomeHarvest](https://youtu.be/JnV7eR2Ve2o) - _updated for release v0.2.7_
|
||||||
|
|
||||||
|
@ -31,136 +35,150 @@ pip install homeharvest
|
||||||
|
|
||||||
### CLI
|
### CLI
|
||||||
|
|
||||||
|
```
|
||||||
|
usage: homeharvest [-l {for_sale,for_rent,sold}] [-o {excel,csv}] [-f FILENAME] [-p PROXY] [-d DAYS] [-r RADIUS] [-m] location
|
||||||
|
|
||||||
|
Home Harvest Property Scraper
|
||||||
|
|
||||||
|
positional arguments:
|
||||||
|
location Location to scrape (e.g., San Francisco, CA)
|
||||||
|
|
||||||
|
options:
|
||||||
|
-l {for_sale,for_rent,sold}, --listing_type {for_sale,for_rent,sold}
|
||||||
|
Listing type to scrape
|
||||||
|
-o {excel,csv}, --output {excel,csv}
|
||||||
|
Output format
|
||||||
|
-f FILENAME, --filename FILENAME
|
||||||
|
Name of the output file (without extension)
|
||||||
|
-p PROXY, --proxy PROXY
|
||||||
|
Proxy to use for scraping
|
||||||
|
-d DAYS, --days DAYS Sold/listed in last _ days filter.
|
||||||
|
-r RADIUS, --radius RADIUS
|
||||||
|
Get comparable properties within _ (eg. 0.0) miles. Only applicable for individual addresses.
|
||||||
|
-m, --mls_only If set, fetches only MLS listings.
|
||||||
|
```
|
||||||
```bash
|
```bash
|
||||||
homeharvest "San Francisco, CA" -s zillow realtor.com redfin -l for_rent -o excel -f HomeHarvest
|
> homeharvest "San Francisco, CA" -l for_rent -o excel -f HomeHarvest
|
||||||
```
|
```
|
||||||
|
|
||||||
This will scrape properties from the specified sites for the given location and listing type, and save the results to an Excel file named `HomeHarvest.xlsx`.
|
|
||||||
|
|
||||||
By default:
|
|
||||||
- If `-s` or `--site_name` is not provided, it will scrape from all available sites.
|
|
||||||
- If `-l` or `--listing_type` is left blank, the default is `for_sale`. Other options are `for_rent` or `sold`.
|
|
||||||
- The `-o` or `--output` default format is `excel`. Options are `csv` or `excel`.
|
|
||||||
- If `-f` or `--filename` is left blank, the default is `HomeHarvest_<current_timestamp>`.
|
|
||||||
- If `-p` or `--proxy` is not provided, the scraper uses the local IP.
|
|
||||||
- Use `-k` or `--keep_duplicates` to keep duplicate properties based on address. If not provided, duplicates will be removed.
|
|
||||||
### Python
|
### Python
|
||||||
|
|
||||||
```py
|
```py
|
||||||
from homeharvest import scrape_property
|
from homeharvest import scrape_property
|
||||||
import pandas as pd
|
from datetime import datetime
|
||||||
|
|
||||||
properties: pd.DataFrame = scrape_property(
|
# Generate filename based on current timestamp
|
||||||
site_name=["zillow", "realtor.com", "redfin"],
|
current_timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||||
location="85281",
|
filename = f"output/{current_timestamp}.csv"
|
||||||
listing_type="for_rent" # for_sale / sold
|
|
||||||
|
properties = scrape_property(
|
||||||
|
location="San Diego, CA",
|
||||||
|
listing_type="sold", # or (for_sale, for_rent)
|
||||||
|
property_younger_than=30, # sold in last 30 days - listed in last x days if (for_sale, for_rent)
|
||||||
|
mls_only=True, # only fetch MLS listings
|
||||||
)
|
)
|
||||||
|
print(f"Number of properties: {len(properties)}")
|
||||||
|
|
||||||
#: Note, to export to CSV or Excel, use properties.to_csv() or properties.to_excel().
|
# Export to csv
|
||||||
print(properties)
|
properties.to_csv(filename, index=False)
|
||||||
|
print(properties.head())
|
||||||
```
|
```
|
||||||
|
|
||||||
## Output
|
## Output
|
||||||
```py
|
|
||||||
>>> properties.head()
|
|
||||||
property_url site_name listing_type apt_min_price apt_max_price ...
|
|
||||||
0 https://www.redfin.com/AZ/Tempe/1003-W-Washing... redfin for_rent 1666.0 2750.0 ...
|
|
||||||
1 https://www.redfin.com/AZ/Tempe/VELA-at-Town-L... redfin for_rent 1665.0 3763.0 ...
|
|
||||||
2 https://www.redfin.com/AZ/Tempe/Camden-Tempe/a... redfin for_rent 1939.0 3109.0 ...
|
|
||||||
3 https://www.redfin.com/AZ/Tempe/Emerson-Park/a... redfin for_rent 1185.0 1817.0 ...
|
|
||||||
4 https://www.redfin.com/AZ/Tempe/Rio-Paradiso-A... redfin for_rent 1470.0 2235.0 ...
|
|
||||||
[5 rows x 41 columns]
|
|
||||||
```
|
|
||||||
|
|
||||||
### Parameters for `scrape_properties()`
|
|
||||||
```plaintext
|
```plaintext
|
||||||
Required
|
>>> properties.head()
|
||||||
├── location (str): address in various formats e.g. just zip, full address, city/state, etc.
|
MLS MLS # Status Style ... COEDate LotSFApx PrcSqft Stories
|
||||||
└── listing_type (enum): for_rent, for_sale, sold
|
0 SDCA 230018348 SOLD CONDOS ... 2023-10-03 290110 803 2
|
||||||
Optional
|
1 SDCA 230016614 SOLD TOWNHOMES ... 2023-10-03 None 838 3
|
||||||
├── site_name (list[enum], default=all three sites): zillow, realtor.com, redfin
|
2 SDCA 230016367 SOLD CONDOS ... 2023-10-03 30056 649 1
|
||||||
├── proxy (str): in format 'http://user:pass@host:port' or [https, socks]
|
3 MRCA NDP2306335 SOLD SINGLE_FAMILY ... 2023-10-03 7519 661 2
|
||||||
└── keep_duplicates (bool, default=False): whether to keep or remove duplicate properties based on address
|
4 SDCA 230014532 SOLD CONDOS ... 2023-10-03 None 752 1
|
||||||
|
[5 rows x 22 columns]
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Parameters for `scrape_property()`
|
||||||
|
```
|
||||||
|
Required
|
||||||
|
├── location (str): The address in various formats - this could be just a zip code, a full address, or city/state, etc.
|
||||||
|
└── listing_type (option): Choose the type of listing.
|
||||||
|
- 'for_rent'
|
||||||
|
- 'for_sale'
|
||||||
|
- 'sold'
|
||||||
|
|
||||||
|
Optional
|
||||||
|
├── radius (decimal): Radius in miles to find comparable properties based on individual addresses.
|
||||||
|
│ Example: 5.5 (fetches properties within a 5.5-mile radius if location is set to a specific address; otherwise, ignored)
|
||||||
|
│
|
||||||
|
├── property_younger_than (integer): Number of past days to filter properties. Utilizes 'last_sold_date' for 'sold' listing types, and 'list_date' for others (for_rent, for_sale).
|
||||||
|
│ Example: 30 (fetches properties listed/sold in the last 30 days)
|
||||||
|
│
|
||||||
|
├── mls_only (True/False): If set, fetches only MLS listings (mainly applicable to 'sold' listings)
|
||||||
|
│
|
||||||
|
└── proxy (string): In format 'http://user:pass@host:port'
|
||||||
|
|
||||||
|
```
|
||||||
### Property Schema
|
### Property Schema
|
||||||
```plaintext
|
```plaintext
|
||||||
Property
|
Property
|
||||||
├── Basic Information:
|
├── Basic Information:
|
||||||
│ ├── property_url (str)
|
│ ├── property_url
|
||||||
│ ├── site_name (enum): zillow, redfin, realtor.com
|
│ ├── mls
|
||||||
│ ├── listing_type (enum): for_sale, for_rent, sold
|
│ ├── mls_id
|
||||||
│ └── property_type (enum): house, apartment, condo, townhouse, single_family, multi_family, building
|
│ └── status
|
||||||
|
|
||||||
├── Address Details:
|
├── Address Details:
|
||||||
│ ├── street_address (str)
|
│ ├── street
|
||||||
│ ├── city (str)
|
│ ├── unit
|
||||||
│ ├── state (str)
|
│ ├── city
|
||||||
│ ├── zip_code (str)
|
│ ├── state
|
||||||
│ ├── unit (str)
|
│ └── zip_code
|
||||||
│ └── country (str)
|
|
||||||
|
|
||||||
├── House for Sale Features:
|
├── Property Description:
|
||||||
│ ├── tax_assessed_value (int)
|
│ ├── style
|
||||||
│ ├── lot_area_value (float)
|
│ ├── beds
|
||||||
│ ├── lot_area_unit (str)
|
│ ├── full_baths
|
||||||
│ ├── stories (int)
|
│ ├── half_baths
|
||||||
│ ├── year_built (int)
|
│ ├── sqft
|
||||||
│ └── price_per_sqft (int)
|
│ ├── year_built
|
||||||
|
│ ├── stories
|
||||||
|
│ └── lot_sqft
|
||||||
|
|
||||||
├── Building for Sale and Apartment Details:
|
├── Property Listing Details:
|
||||||
│ ├── bldg_name (str)
|
│ ├── list_price
|
||||||
│ ├── beds_min (int)
|
│ ├── list_date
|
||||||
│ ├── beds_max (int)
|
│ ├── sold_price
|
||||||
│ ├── baths_min (float)
|
│ ├── last_sold_date
|
||||||
│ ├── baths_max (float)
|
│ ├── price_per_sqft
|
||||||
│ ├── sqft_min (int)
|
│ └── hoa_fee
|
||||||
│ ├── sqft_max (int)
|
|
||||||
│ ├── price_min (int)
|
|
||||||
│ ├── price_max (int)
|
|
||||||
│ ├── area_min (int)
|
|
||||||
│ └── unit_count (int)
|
|
||||||
|
|
||||||
├── Miscellaneous Details:
|
├── Location Details:
|
||||||
│ ├── mls_id (str)
|
│ ├── latitude
|
||||||
│ ├── agent_name (str)
|
│ ├── longitude
|
||||||
│ ├── img_src (str)
|
|
||||||
│ ├── description (str)
|
|
||||||
│ ├── status_text (str)
|
|
||||||
│ └── posted_time (str)
|
|
||||||
|
|
||||||
└── Location Details:
|
└── Parking Details:
|
||||||
├── latitude (float)
|
└── parking_garage
|
||||||
└── longitude (float)
|
|
||||||
```
|
```
|
||||||
## Supported Countries for Property Scraping
|
|
||||||
|
|
||||||
* **Zillow**: contains listings in the **US** & **Canada**
|
|
||||||
* **Realtor.com**: mainly from the **US** but also has international listings
|
|
||||||
* **Redfin**: listings mainly in the **US**, **Canada**, & has expanded to some areas in **Mexico**
|
|
||||||
|
|
||||||
### Exceptions
|
### Exceptions
|
||||||
The following exceptions may be raised when using HomeHarvest:
|
The following exceptions may be raised when using HomeHarvest:
|
||||||
|
|
||||||
- `InvalidSite` - valid options: `zillow`, `redfin`, `realtor.com`
|
|
||||||
- `InvalidListingType` - valid options: `for_sale`, `for_rent`, `sold`
|
- `InvalidListingType` - valid options: `for_sale`, `for_rent`, `sold`
|
||||||
- `NoResultsFound` - no properties found from your input
|
- `NoResultsFound` - no properties found from your search
|
||||||
- `GeoCoordsNotFound` - if Zillow scraper is not able to derive geo-coordinates from the location you input
|
|
||||||
|
|
||||||
## Frequently Asked Questions
|
## Frequently Asked Questions
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
**Q: Encountering issues with your queries?**
|
**Q: Encountering issues with your searches?**
|
||||||
**A:** Try a single site and/or broaden the location. If problems persist, [submit an issue](https://github.com/ZacharyHampton/HomeHarvest/issues).
|
**A:** Try to broaden the parameters you're using. If problems persist, [submit an issue](https://github.com/ZacharyHampton/HomeHarvest/issues).
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
**Q: Received a Forbidden 403 response code?**
|
**Q: Received a Forbidden 403 response code?**
|
||||||
**A:** This indicates that you have been blocked by the real estate site for sending too many requests. Currently, **Zillow** is particularly aggressive with blocking. We recommend:
|
**A:** This indicates that you have been blocked by Realtor.com for sending too many requests. We recommend:
|
||||||
|
|
||||||
- Waiting a few seconds between requests.
|
- Waiting a few seconds between requests.
|
||||||
- Trying a VPN to change your IP address.
|
- Trying a VPN or useing a proxy as a parameter to scrape_property() to change your IP address.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
|
@ -31,7 +31,7 @@
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"# scrapes all 3 sites by default\n",
|
"# check for sale properties\n",
|
||||||
"scrape_property(\n",
|
"scrape_property(\n",
|
||||||
" location=\"dallas\",\n",
|
" location=\"dallas\",\n",
|
||||||
" listing_type=\"for_sale\"\n",
|
" listing_type=\"for_sale\"\n",
|
||||||
|
@ -53,7 +53,6 @@
|
||||||
"# search a specific address\n",
|
"# search a specific address\n",
|
||||||
"scrape_property(\n",
|
"scrape_property(\n",
|
||||||
" location=\"2530 Al Lipscomb Way\",\n",
|
" location=\"2530 Al Lipscomb Way\",\n",
|
||||||
" site_name=\"zillow\",\n",
|
|
||||||
" listing_type=\"for_sale\"\n",
|
" listing_type=\"for_sale\"\n",
|
||||||
")"
|
")"
|
||||||
]
|
]
|
||||||
|
@ -68,7 +67,6 @@
|
||||||
"# check rentals\n",
|
"# check rentals\n",
|
||||||
"scrape_property(\n",
|
"scrape_property(\n",
|
||||||
" location=\"chicago, illinois\",\n",
|
" location=\"chicago, illinois\",\n",
|
||||||
" site_name=[\"redfin\", \"zillow\"],\n",
|
|
||||||
" listing_type=\"for_rent\"\n",
|
" listing_type=\"for_rent\"\n",
|
||||||
")"
|
")"
|
||||||
]
|
]
|
||||||
|
@ -88,7 +86,6 @@
|
||||||
"# check sold properties\n",
|
"# check sold properties\n",
|
||||||
"scrape_property(\n",
|
"scrape_property(\n",
|
||||||
" location=\"90210\",\n",
|
" location=\"90210\",\n",
|
||||||
" site_name=[\"redfin\"],\n",
|
|
||||||
" listing_type=\"sold\"\n",
|
" listing_type=\"sold\"\n",
|
||||||
")"
|
")"
|
||||||
]
|
]
|
|
@ -0,0 +1,18 @@
|
||||||
|
from homeharvest import scrape_property
|
||||||
|
from datetime import datetime
|
||||||
|
|
||||||
|
# Generate filename based on current timestamp
|
||||||
|
current_timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||||
|
filename = f"output/{current_timestamp}.csv"
|
||||||
|
|
||||||
|
properties = scrape_property(
|
||||||
|
location="San Diego, CA",
|
||||||
|
listing_type="sold", # for_sale, for_rent
|
||||||
|
property_younger_than=30, # sold/listed in last 30 days
|
||||||
|
mls_only=True, # only fetch MLS listings
|
||||||
|
)
|
||||||
|
print(f"Number of properties: {len(properties)}")
|
||||||
|
|
||||||
|
# Export to csv
|
||||||
|
properties.to_csv(filename, index=False)
|
||||||
|
print(properties.head())
|
|
@ -1,187 +1,50 @@
|
||||||
|
import warnings
|
||||||
import pandas as pd
|
import pandas as pd
|
||||||
from typing import Union
|
|
||||||
import concurrent.futures
|
|
||||||
from concurrent.futures import ThreadPoolExecutor
|
|
||||||
|
|
||||||
from .core.scrapers import ScraperInput
|
from .core.scrapers import ScraperInput
|
||||||
from .core.scrapers.redfin import RedfinScraper
|
from .utils import process_result, ordered_properties, validate_input
|
||||||
from .core.scrapers.realtor import RealtorScraper
|
from .core.scrapers.realtor import RealtorScraper
|
||||||
from .core.scrapers.zillow import ZillowScraper
|
from .core.scrapers.models import ListingType
|
||||||
from .core.scrapers.models import ListingType, Property, SiteName
|
from .exceptions import InvalidListingType, NoResultsFound
|
||||||
from .exceptions import InvalidSite, InvalidListingType
|
|
||||||
|
|
||||||
|
|
||||||
_scrapers = {
|
|
||||||
"redfin": RedfinScraper,
|
|
||||||
"realtor.com": RealtorScraper,
|
|
||||||
"zillow": ZillowScraper,
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
def _validate_input(site_name: str, listing_type: str) -> None:
|
|
||||||
if site_name.lower() not in _scrapers:
|
|
||||||
raise InvalidSite(f"Provided site, '{site_name}', does not exist.")
|
|
||||||
|
|
||||||
if listing_type.upper() not in ListingType.__members__:
|
|
||||||
raise InvalidListingType(f"Provided listing type, '{listing_type}', does not exist.")
|
|
||||||
|
|
||||||
|
|
||||||
def _get_ordered_properties(result: Property) -> list[str]:
|
|
||||||
return [
|
|
||||||
"property_url",
|
|
||||||
"site_name",
|
|
||||||
"listing_type",
|
|
||||||
"property_type",
|
|
||||||
"status_text",
|
|
||||||
"baths_min",
|
|
||||||
"baths_max",
|
|
||||||
"beds_min",
|
|
||||||
"beds_max",
|
|
||||||
"sqft_min",
|
|
||||||
"sqft_max",
|
|
||||||
"price_min",
|
|
||||||
"price_max",
|
|
||||||
"unit_count",
|
|
||||||
"tax_assessed_value",
|
|
||||||
"price_per_sqft",
|
|
||||||
"lot_area_value",
|
|
||||||
"lot_area_unit",
|
|
||||||
"address_one",
|
|
||||||
"address_two",
|
|
||||||
"city",
|
|
||||||
"state",
|
|
||||||
"zip_code",
|
|
||||||
"posted_time",
|
|
||||||
"area_min",
|
|
||||||
"bldg_name",
|
|
||||||
"stories",
|
|
||||||
"year_built",
|
|
||||||
"agent_name",
|
|
||||||
"agent_phone",
|
|
||||||
"agent_email",
|
|
||||||
"days_on_market",
|
|
||||||
"sold_date",
|
|
||||||
"mls_id",
|
|
||||||
"img_src",
|
|
||||||
"latitude",
|
|
||||||
"longitude",
|
|
||||||
"description",
|
|
||||||
]
|
|
||||||
|
|
||||||
|
|
||||||
def _process_result(result: Property) -> pd.DataFrame:
|
|
||||||
prop_data = result.__dict__
|
|
||||||
|
|
||||||
prop_data["site_name"] = prop_data["site_name"].value
|
|
||||||
prop_data["listing_type"] = prop_data["listing_type"].value.lower()
|
|
||||||
if "property_type" in prop_data and prop_data["property_type"] is not None:
|
|
||||||
prop_data["property_type"] = prop_data["property_type"].value.lower()
|
|
||||||
else:
|
|
||||||
prop_data["property_type"] = None
|
|
||||||
if "address" in prop_data:
|
|
||||||
address_data = prop_data["address"]
|
|
||||||
prop_data["address_one"] = address_data.address_one
|
|
||||||
prop_data["address_two"] = address_data.address_two
|
|
||||||
prop_data["city"] = address_data.city
|
|
||||||
prop_data["state"] = address_data.state
|
|
||||||
prop_data["zip_code"] = address_data.zip_code
|
|
||||||
|
|
||||||
del prop_data["address"]
|
|
||||||
|
|
||||||
if "agent" in prop_data and prop_data["agent"] is not None:
|
|
||||||
agent_data = prop_data["agent"]
|
|
||||||
prop_data["agent_name"] = agent_data.name
|
|
||||||
prop_data["agent_phone"] = agent_data.phone
|
|
||||||
prop_data["agent_email"] = agent_data.email
|
|
||||||
|
|
||||||
del prop_data["agent"]
|
|
||||||
else:
|
|
||||||
prop_data["agent_name"] = None
|
|
||||||
prop_data["agent_phone"] = None
|
|
||||||
prop_data["agent_email"] = None
|
|
||||||
|
|
||||||
properties_df = pd.DataFrame([prop_data])
|
|
||||||
properties_df = properties_df[_get_ordered_properties(result)]
|
|
||||||
|
|
||||||
return properties_df
|
|
||||||
|
|
||||||
|
|
||||||
def _scrape_single_site(location: str, site_name: str, listing_type: str, proxy: str = None) -> pd.DataFrame:
|
|
||||||
"""
|
|
||||||
Helper function to scrape a single site.
|
|
||||||
"""
|
|
||||||
_validate_input(site_name, listing_type)
|
|
||||||
|
|
||||||
scraper_input = ScraperInput(
|
|
||||||
location=location,
|
|
||||||
listing_type=ListingType[listing_type.upper()],
|
|
||||||
site_name=SiteName.get_by_value(site_name.lower()),
|
|
||||||
proxy=proxy,
|
|
||||||
)
|
|
||||||
|
|
||||||
site = _scrapers[site_name.lower()](scraper_input)
|
|
||||||
results = site.search()
|
|
||||||
|
|
||||||
properties_dfs = [_process_result(result) for result in results]
|
|
||||||
properties_dfs = [df.dropna(axis=1, how="all") for df in properties_dfs if not df.empty]
|
|
||||||
if not properties_dfs:
|
|
||||||
return pd.DataFrame()
|
|
||||||
|
|
||||||
return pd.concat(properties_dfs, ignore_index=True)
|
|
||||||
|
|
||||||
|
|
||||||
def scrape_property(
|
def scrape_property(
|
||||||
location: str,
|
location: str,
|
||||||
site_name: Union[str, list[str]] = None,
|
|
||||||
listing_type: str = "for_sale",
|
listing_type: str = "for_sale",
|
||||||
|
radius: float = None,
|
||||||
|
mls_only: bool = False,
|
||||||
|
property_younger_than: int = None,
|
||||||
|
pending_or_contingent: bool = False,
|
||||||
proxy: str = None,
|
proxy: str = None,
|
||||||
keep_duplicates: bool = False
|
|
||||||
) -> pd.DataFrame:
|
) -> pd.DataFrame:
|
||||||
"""
|
"""
|
||||||
Scrape property from various sites from a given location and listing type.
|
Scrape properties from Realtor.com based on a given location and listing type.
|
||||||
|
:param location: Location to search (e.g. "Dallas, TX", "85281", "2530 Al Lipscomb Way")
|
||||||
:returns: pd.DataFrame
|
:param listing_type: Listing Type (for_sale, for_rent, sold)
|
||||||
:param location: US Location (e.g. 'San Francisco, CA', 'Cook County, IL', '85281', '2530 Al Lipscomb Way')
|
:param radius: Get properties within _ (e.g. 1.0) miles. Only applicable for individual addresses.
|
||||||
:param site_name: Site name or list of site names (e.g. ['realtor.com', 'zillow'], 'redfin')
|
:param mls_only: If set, fetches only listings with MLS IDs.
|
||||||
:param listing_type: Listing type (e.g. 'for_sale', 'for_rent', 'sold')
|
:param property_younger_than: Get properties sold/listed in last _ days.
|
||||||
:return: pd.DataFrame containing properties
|
:param pending_or_contingent: If set, fetches only pending or contingent listings. Only applicable for for_sale listings from general area searches.
|
||||||
|
:param proxy: Proxy to use for scraping
|
||||||
"""
|
"""
|
||||||
if site_name is None:
|
validate_input(listing_type)
|
||||||
site_name = list(_scrapers.keys())
|
|
||||||
|
|
||||||
if not isinstance(site_name, list):
|
scraper_input = ScraperInput(
|
||||||
site_name = [site_name]
|
location=location,
|
||||||
|
listing_type=ListingType[listing_type.upper()],
|
||||||
|
proxy=proxy,
|
||||||
|
radius=radius,
|
||||||
|
mls_only=mls_only,
|
||||||
|
last_x_days=property_younger_than,
|
||||||
|
pending_or_contingent=pending_or_contingent,
|
||||||
|
)
|
||||||
|
|
||||||
results = []
|
site = RealtorScraper(scraper_input)
|
||||||
|
results = site.search()
|
||||||
|
|
||||||
if len(site_name) == 1:
|
properties_dfs = [process_result(result) for result in results]
|
||||||
final_df = _scrape_single_site(location, site_name[0], listing_type, proxy)
|
if not properties_dfs:
|
||||||
results.append(final_df)
|
raise NoResultsFound("no results found for the query")
|
||||||
else:
|
|
||||||
with ThreadPoolExecutor() as executor:
|
|
||||||
futures = {
|
|
||||||
executor.submit(_scrape_single_site, location, s_name, listing_type, proxy): s_name
|
|
||||||
for s_name in site_name
|
|
||||||
}
|
|
||||||
|
|
||||||
for future in concurrent.futures.as_completed(futures):
|
with warnings.catch_warnings():
|
||||||
result = future.result()
|
warnings.simplefilter("ignore", category=FutureWarning)
|
||||||
results.append(result)
|
return pd.concat(properties_dfs, ignore_index=True, axis=0)[ordered_properties]
|
||||||
|
|
||||||
results = [df for df in results if not df.empty and not df.isna().all().all()]
|
|
||||||
|
|
||||||
if not results:
|
|
||||||
return pd.DataFrame()
|
|
||||||
|
|
||||||
final_df = pd.concat(results, ignore_index=True)
|
|
||||||
|
|
||||||
columns_to_track = ["address_one", "address_two", "city"]
|
|
||||||
|
|
||||||
#: validate they exist, otherwise create them
|
|
||||||
for col in columns_to_track:
|
|
||||||
if col not in final_df.columns:
|
|
||||||
final_df[col] = None
|
|
||||||
|
|
||||||
if not keep_duplicates:
|
|
||||||
final_df = final_df.drop_duplicates(subset=columns_to_track, keep="first")
|
|
||||||
return final_df
|
|
||||||
|
|
|
@ -5,15 +5,8 @@ from homeharvest import scrape_property
|
||||||
|
|
||||||
def main():
|
def main():
|
||||||
parser = argparse.ArgumentParser(description="Home Harvest Property Scraper")
|
parser = argparse.ArgumentParser(description="Home Harvest Property Scraper")
|
||||||
parser.add_argument("location", type=str, help="Location to scrape (e.g., San Francisco, CA)")
|
|
||||||
|
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"-s",
|
"location", type=str, help="Location to scrape (e.g., San Francisco, CA)"
|
||||||
"--site_name",
|
|
||||||
type=str,
|
|
||||||
nargs="*",
|
|
||||||
default=None,
|
|
||||||
help="Site name(s) to scrape from (e.g., realtor, zillow)",
|
|
||||||
)
|
)
|
||||||
|
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
|
@ -43,17 +36,40 @@ def main():
|
||||||
)
|
)
|
||||||
|
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"-k",
|
"-p", "--proxy", type=str, default=None, help="Proxy to use for scraping"
|
||||||
"--keep_duplicates",
|
)
|
||||||
action="store_true",
|
parser.add_argument(
|
||||||
help="Keep duplicate properties based on address"
|
"-d",
|
||||||
|
"--days",
|
||||||
|
type=int,
|
||||||
|
default=None,
|
||||||
|
help="Sold/listed in last _ days filter.",
|
||||||
)
|
)
|
||||||
|
|
||||||
parser.add_argument("-p", "--proxy", type=str, default=None, help="Proxy to use for scraping")
|
parser.add_argument(
|
||||||
|
"-r",
|
||||||
|
"--radius",
|
||||||
|
type=float,
|
||||||
|
default=None,
|
||||||
|
help="Get comparable properties within _ (eg. 0.0) miles. Only applicable for individual addresses.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"-m",
|
||||||
|
"--mls_only",
|
||||||
|
action="store_true",
|
||||||
|
help="If set, fetches only MLS listings.",
|
||||||
|
)
|
||||||
|
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
|
||||||
result = scrape_property(args.location, args.site_name, args.listing_type, proxy=args.proxy, keep_duplicates=args.keep_duplicates)
|
result = scrape_property(
|
||||||
|
args.location,
|
||||||
|
args.listing_type,
|
||||||
|
radius=args.radius,
|
||||||
|
proxy=args.proxy,
|
||||||
|
mls_only=args.mls_only,
|
||||||
|
property_younger_than=args.days,
|
||||||
|
)
|
||||||
|
|
||||||
if not args.filename:
|
if not args.filename:
|
||||||
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
|
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||||
|
|
|
@ -8,12 +8,19 @@ from .models import Property, ListingType, SiteName
|
||||||
class ScraperInput:
|
class ScraperInput:
|
||||||
location: str
|
location: str
|
||||||
listing_type: ListingType
|
listing_type: ListingType
|
||||||
site_name: SiteName
|
radius: float | None = None
|
||||||
|
mls_only: bool | None = None
|
||||||
proxy: str | None = None
|
proxy: str | None = None
|
||||||
|
last_x_days: int | None = None
|
||||||
|
pending_or_contingent: bool | None = None
|
||||||
|
|
||||||
|
|
||||||
class Scraper:
|
class Scraper:
|
||||||
def __init__(self, scraper_input: ScraperInput, session: requests.Session | tls_client.Session = None):
|
def __init__(
|
||||||
|
self,
|
||||||
|
scraper_input: ScraperInput,
|
||||||
|
session: requests.Session | tls_client.Session = None,
|
||||||
|
):
|
||||||
self.location = scraper_input.location
|
self.location = scraper_input.location
|
||||||
self.listing_type = scraper_input.listing_type
|
self.listing_type = scraper_input.listing_type
|
||||||
|
|
||||||
|
@ -28,7 +35,10 @@ class Scraper:
|
||||||
self.session.proxies.update(proxies)
|
self.session.proxies.update(proxies)
|
||||||
|
|
||||||
self.listing_type = scraper_input.listing_type
|
self.listing_type = scraper_input.listing_type
|
||||||
self.site_name = scraper_input.site_name
|
self.radius = scraper_input.radius
|
||||||
|
self.last_x_days = scraper_input.last_x_days
|
||||||
|
self.mls_only = scraper_input.mls_only
|
||||||
|
self.pending_or_contingent = scraper_input.pending_or_contingent
|
||||||
|
|
||||||
def search(self) -> list[Property]:
|
def search(self) -> list[Property]:
|
||||||
...
|
...
|
||||||
|
|
|
@ -1,7 +1,6 @@
|
||||||
from dataclasses import dataclass
|
from dataclasses import dataclass
|
||||||
from enum import Enum
|
from enum import Enum
|
||||||
from typing import Tuple
|
from typing import Optional
|
||||||
from datetime import datetime
|
|
||||||
|
|
||||||
|
|
||||||
class SiteName(Enum):
|
class SiteName(Enum):
|
||||||
|
@ -23,98 +22,44 @@ class ListingType(Enum):
|
||||||
SOLD = "SOLD"
|
SOLD = "SOLD"
|
||||||
|
|
||||||
|
|
||||||
class PropertyType(Enum):
|
|
||||||
HOUSE = "HOUSE"
|
|
||||||
BUILDING = "BUILDING"
|
|
||||||
CONDO = "CONDO"
|
|
||||||
TOWNHOUSE = "TOWNHOUSE"
|
|
||||||
SINGLE_FAMILY = "SINGLE_FAMILY"
|
|
||||||
MULTI_FAMILY = "MULTI_FAMILY"
|
|
||||||
MANUFACTURED = "MANUFACTURED"
|
|
||||||
NEW_CONSTRUCTION = "NEW_CONSTRUCTION"
|
|
||||||
APARTMENT = "APARTMENT"
|
|
||||||
APARTMENTS = "APARTMENTS"
|
|
||||||
LAND = "LAND"
|
|
||||||
LOT = "LOT"
|
|
||||||
OTHER = "OTHER"
|
|
||||||
|
|
||||||
BLANK = "BLANK"
|
|
||||||
|
|
||||||
@classmethod
|
|
||||||
def from_int_code(cls, code):
|
|
||||||
mapping = {
|
|
||||||
1: cls.HOUSE,
|
|
||||||
2: cls.CONDO,
|
|
||||||
3: cls.TOWNHOUSE,
|
|
||||||
4: cls.MULTI_FAMILY,
|
|
||||||
5: cls.LAND,
|
|
||||||
6: cls.OTHER,
|
|
||||||
8: cls.SINGLE_FAMILY,
|
|
||||||
13: cls.SINGLE_FAMILY,
|
|
||||||
}
|
|
||||||
|
|
||||||
return mapping.get(code, cls.BLANK)
|
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
@dataclass
|
||||||
class Address:
|
class Address:
|
||||||
address_one: str | None = None
|
street: str | None = None
|
||||||
address_two: str | None = "#"
|
unit: str | None = None
|
||||||
city: str | None = None
|
city: str | None = None
|
||||||
state: str | None = None
|
state: str | None = None
|
||||||
zip_code: str | None = None
|
zip: str | None = None
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
@dataclass
|
||||||
class Agent:
|
class Description:
|
||||||
name: str
|
style: str | None = None
|
||||||
phone: str | None = None
|
beds: int | None = None
|
||||||
email: str | None = None
|
baths_full: int | None = None
|
||||||
|
baths_half: int | None = None
|
||||||
|
sqft: int | None = None
|
||||||
|
lot_sqft: int | None = None
|
||||||
|
sold_price: int | None = None
|
||||||
|
year_built: int | None = None
|
||||||
|
garage: float | None = None
|
||||||
|
stories: int | None = None
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
@dataclass
|
||||||
class Property:
|
class Property:
|
||||||
property_url: str
|
property_url: str
|
||||||
site_name: SiteName
|
mls: str | None = None
|
||||||
listing_type: ListingType
|
|
||||||
address: Address
|
|
||||||
property_type: PropertyType | None = None
|
|
||||||
|
|
||||||
# house for sale
|
|
||||||
tax_assessed_value: int | None = None
|
|
||||||
lot_area_value: float | None = None
|
|
||||||
lot_area_unit: str | None = None
|
|
||||||
stories: int | None = None
|
|
||||||
year_built: int | None = None
|
|
||||||
price_per_sqft: int | None = None
|
|
||||||
mls_id: str | None = None
|
mls_id: str | None = None
|
||||||
|
status: str | None = None
|
||||||
|
address: Address | None = None
|
||||||
|
|
||||||
agent: Agent | None = None
|
list_price: int | None = None
|
||||||
img_src: str | None = None
|
list_date: str | None = None
|
||||||
description: str | None = None
|
last_sold_date: str | None = None
|
||||||
status_text: str | None = None
|
prc_sqft: int | None = None
|
||||||
posted_time: datetime | None = None
|
hoa_fee: int | None = None
|
||||||
|
description: Description | None = None
|
||||||
# building for sale
|
|
||||||
bldg_name: str | None = None
|
|
||||||
area_min: int | None = None
|
|
||||||
|
|
||||||
beds_min: int | None = None
|
|
||||||
beds_max: int | None = None
|
|
||||||
|
|
||||||
baths_min: float | None = None
|
|
||||||
baths_max: float | None = None
|
|
||||||
|
|
||||||
sqft_min: int | None = None
|
|
||||||
sqft_max: int | None = None
|
|
||||||
|
|
||||||
price_min: int | None = None
|
|
||||||
price_max: int | None = None
|
|
||||||
|
|
||||||
unit_count: int | None = None
|
|
||||||
|
|
||||||
latitude: float | None = None
|
latitude: float | None = None
|
||||||
longitude: float | None = None
|
longitude: float | None = None
|
||||||
|
neighborhoods: Optional[str] = None
|
||||||
sold_date: datetime | None = None
|
|
||||||
days_on_market: int | None = None
|
|
||||||
|
|
|
@ -2,39 +2,26 @@
|
||||||
homeharvest.realtor.__init__
|
homeharvest.realtor.__init__
|
||||||
~~~~~~~~~~~~
|
~~~~~~~~~~~~
|
||||||
|
|
||||||
This module implements the scraper for relator.com
|
This module implements the scraper for realtor.com
|
||||||
"""
|
"""
|
||||||
from ..models import Property, Address
|
from typing import Dict, Union, Optional
|
||||||
|
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||||
|
|
||||||
from .. import Scraper
|
from .. import Scraper
|
||||||
from ....exceptions import NoResultsFound
|
from ....exceptions import NoResultsFound
|
||||||
from ....utils import parse_address_one, parse_address_two
|
from ..models import Property, Address, ListingType, Description
|
||||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
|
||||||
|
|
||||||
|
|
||||||
class RealtorScraper(Scraper):
|
class RealtorScraper(Scraper):
|
||||||
|
SEARCH_GQL_URL = "https://www.realtor.com/api/v1/rdc_search_srp?client_id=rdc-search-new-communities&schema=vesta"
|
||||||
|
PROPERTY_URL = "https://www.realtor.com/realestateandhomes-detail/"
|
||||||
|
ADDRESS_AUTOCOMPLETE_URL = "https://parser-external.geo.moveaws.com/suggest"
|
||||||
|
|
||||||
def __init__(self, scraper_input):
|
def __init__(self, scraper_input):
|
||||||
self.counter = 1
|
self.counter = 1
|
||||||
super().__init__(scraper_input)
|
super().__init__(scraper_input)
|
||||||
self.search_url = (
|
|
||||||
"https://www.realtor.com/api/v1/rdc_search_srp?client_id=rdc-search-new-communities&schema=vesta"
|
|
||||||
)
|
|
||||||
|
|
||||||
def handle_location(self):
|
def handle_location(self):
|
||||||
headers = {
|
|
||||||
"authority": "parser-external.geo.moveaws.com",
|
|
||||||
"accept": "*/*",
|
|
||||||
"accept-language": "en-US,en;q=0.9",
|
|
||||||
"origin": "https://www.realtor.com",
|
|
||||||
"referer": "https://www.realtor.com/",
|
|
||||||
"sec-ch-ua": '"Chromium";v="116", "Not)A;Brand";v="24", "Google Chrome";v="116"',
|
|
||||||
"sec-ch-ua-mobile": "?0",
|
|
||||||
"sec-ch-ua-platform": '"Windows"',
|
|
||||||
"sec-fetch-dest": "empty",
|
|
||||||
"sec-fetch-mode": "cors",
|
|
||||||
"sec-fetch-site": "cross-site",
|
|
||||||
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36",
|
|
||||||
}
|
|
||||||
|
|
||||||
params = {
|
params = {
|
||||||
"input": self.location,
|
"input": self.location,
|
||||||
"client_id": self.listing_type.value.lower().replace("_", "-"),
|
"client_id": self.listing_type.value.lower().replace("_", "-"),
|
||||||
|
@ -43,9 +30,8 @@ class RealtorScraper(Scraper):
|
||||||
}
|
}
|
||||||
|
|
||||||
response = self.session.get(
|
response = self.session.get(
|
||||||
"https://parser-external.geo.moveaws.com/suggest",
|
self.ADDRESS_AUTOCOMPLETE_URL,
|
||||||
params=params,
|
params=params,
|
||||||
headers=headers,
|
|
||||||
)
|
)
|
||||||
response_json = response.json()
|
response_json = response.json()
|
||||||
|
|
||||||
|
@ -56,6 +42,145 @@ class RealtorScraper(Scraper):
|
||||||
|
|
||||||
return result[0]
|
return result[0]
|
||||||
|
|
||||||
|
def handle_listing(self, listing_id: str) -> list[Property]:
|
||||||
|
query = """query Listing($listing_id: ID!) {
|
||||||
|
listing(id: $listing_id) {
|
||||||
|
source {
|
||||||
|
id
|
||||||
|
listing_id
|
||||||
|
}
|
||||||
|
address {
|
||||||
|
street_number
|
||||||
|
street_name
|
||||||
|
street_suffix
|
||||||
|
unit
|
||||||
|
city
|
||||||
|
state_code
|
||||||
|
postal_code
|
||||||
|
location {
|
||||||
|
coordinate {
|
||||||
|
lat
|
||||||
|
lon
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
basic {
|
||||||
|
sqft
|
||||||
|
beds
|
||||||
|
baths_full
|
||||||
|
baths_half
|
||||||
|
lot_sqft
|
||||||
|
sold_price
|
||||||
|
sold_price
|
||||||
|
type
|
||||||
|
price
|
||||||
|
status
|
||||||
|
sold_date
|
||||||
|
list_date
|
||||||
|
}
|
||||||
|
details {
|
||||||
|
year_built
|
||||||
|
stories
|
||||||
|
garage
|
||||||
|
permalink
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}"""
|
||||||
|
|
||||||
|
variables = {"listing_id": listing_id}
|
||||||
|
payload = {
|
||||||
|
"query": query,
|
||||||
|
"variables": variables,
|
||||||
|
}
|
||||||
|
|
||||||
|
response = self.session.post(self.SEARCH_GQL_URL, json=payload)
|
||||||
|
response_json = response.json()
|
||||||
|
|
||||||
|
property_info = response_json["data"]["listing"]
|
||||||
|
|
||||||
|
mls = (
|
||||||
|
property_info["source"].get("id")
|
||||||
|
if "source" in property_info and isinstance(property_info["source"], dict)
|
||||||
|
else None
|
||||||
|
)
|
||||||
|
|
||||||
|
able_to_get_lat_long = (
|
||||||
|
property_info
|
||||||
|
and property_info.get("address")
|
||||||
|
and property_info["address"].get("location")
|
||||||
|
and property_info["address"]["location"].get("coordinate")
|
||||||
|
)
|
||||||
|
|
||||||
|
listing = Property(
|
||||||
|
mls=mls,
|
||||||
|
mls_id=property_info["source"].get("listing_id")
|
||||||
|
if "source" in property_info and isinstance(property_info["source"], dict)
|
||||||
|
else None,
|
||||||
|
property_url=f"{self.PROPERTY_URL}{property_info['details']['permalink']}",
|
||||||
|
status=property_info["basic"]["status"].upper(),
|
||||||
|
list_price=property_info["basic"]["price"],
|
||||||
|
list_date=property_info["basic"]["list_date"].split("T")[0]
|
||||||
|
if property_info["basic"].get("list_date")
|
||||||
|
else None,
|
||||||
|
prc_sqft=property_info["basic"].get("price") / property_info["basic"].get("sqft")
|
||||||
|
if property_info["basic"].get("price") and property_info["basic"].get("sqft")
|
||||||
|
else None,
|
||||||
|
last_sold_date=property_info["basic"]["sold_date"].split("T")[0]
|
||||||
|
if property_info["basic"].get("sold_date")
|
||||||
|
else None,
|
||||||
|
latitude=property_info["address"]["location"]["coordinate"].get("lat")
|
||||||
|
if able_to_get_lat_long
|
||||||
|
else None,
|
||||||
|
longitude=property_info["address"]["location"]["coordinate"].get("lon")
|
||||||
|
if able_to_get_lat_long
|
||||||
|
else None,
|
||||||
|
address=self._parse_address(property_info, search_type="handle_listing"),
|
||||||
|
description=Description(
|
||||||
|
style=property_info["basic"].get("type", "").upper(),
|
||||||
|
beds=property_info["basic"].get("beds"),
|
||||||
|
baths_full=property_info["basic"].get("baths_full"),
|
||||||
|
baths_half=property_info["basic"].get("baths_half"),
|
||||||
|
sqft=property_info["basic"].get("sqft"),
|
||||||
|
lot_sqft=property_info["basic"].get("lot_sqft"),
|
||||||
|
sold_price=property_info["basic"].get("sold_price"),
|
||||||
|
year_built=property_info["details"].get("year_built"),
|
||||||
|
garage=property_info["details"].get("garage"),
|
||||||
|
stories=property_info["details"].get("stories"),
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
return [listing]
|
||||||
|
|
||||||
|
def get_latest_listing_id(self, property_id: str) -> str | None:
|
||||||
|
query = """query Property($property_id: ID!) {
|
||||||
|
property(id: $property_id) {
|
||||||
|
listings {
|
||||||
|
listing_id
|
||||||
|
primary
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
"""
|
||||||
|
|
||||||
|
variables = {"property_id": property_id}
|
||||||
|
payload = {
|
||||||
|
"query": query,
|
||||||
|
"variables": variables,
|
||||||
|
}
|
||||||
|
|
||||||
|
response = self.session.post(self.SEARCH_GQL_URL, json=payload)
|
||||||
|
response_json = response.json()
|
||||||
|
|
||||||
|
property_info = response_json["data"]["property"]
|
||||||
|
if property_info["listings"] is None:
|
||||||
|
return None
|
||||||
|
|
||||||
|
primary_listing = next((listing for listing in property_info["listings"] if listing["primary"]), None)
|
||||||
|
if primary_listing:
|
||||||
|
return primary_listing["listing_id"]
|
||||||
|
else:
|
||||||
|
return property_info["listings"][0]["listing_id"]
|
||||||
|
|
||||||
def handle_address(self, property_id: str) -> list[Property]:
|
def handle_address(self, property_id: str) -> list[Property]:
|
||||||
"""
|
"""
|
||||||
Handles a specific address & returns one property
|
Handles a specific address & returns one property
|
||||||
|
@ -71,22 +196,19 @@ class RealtorScraper(Scraper):
|
||||||
stories
|
stories
|
||||||
}
|
}
|
||||||
address {
|
address {
|
||||||
address_validation_code
|
|
||||||
city
|
|
||||||
country
|
|
||||||
county
|
|
||||||
line
|
|
||||||
postal_code
|
|
||||||
state_code
|
|
||||||
street_direction
|
|
||||||
street_name
|
|
||||||
street_number
|
street_number
|
||||||
|
street_name
|
||||||
street_suffix
|
street_suffix
|
||||||
street_post_direction
|
|
||||||
unit_value
|
|
||||||
unit
|
unit
|
||||||
unit_descriptor
|
city
|
||||||
zip
|
state_code
|
||||||
|
postal_code
|
||||||
|
location {
|
||||||
|
coordinate {
|
||||||
|
lat
|
||||||
|
lon
|
||||||
|
}
|
||||||
|
}
|
||||||
}
|
}
|
||||||
basic {
|
basic {
|
||||||
baths
|
baths
|
||||||
|
@ -114,127 +236,175 @@ class RealtorScraper(Scraper):
|
||||||
"variables": variables,
|
"variables": variables,
|
||||||
}
|
}
|
||||||
|
|
||||||
response = self.session.post(self.search_url, json=payload)
|
response = self.session.post(self.SEARCH_GQL_URL, json=payload)
|
||||||
response_json = response.json()
|
response_json = response.json()
|
||||||
|
|
||||||
property_info = response_json["data"]["property"]
|
property_info = response_json["data"]["property"]
|
||||||
address_one, address_two = parse_address_one(property_info["address"]["line"])
|
|
||||||
|
|
||||||
return [
|
return [
|
||||||
Property(
|
Property(
|
||||||
site_name=self.site_name,
|
|
||||||
address=Address(
|
|
||||||
address_one=address_one,
|
|
||||||
address_two=address_two,
|
|
||||||
city=property_info["address"]["city"],
|
|
||||||
state=property_info["address"]["state_code"],
|
|
||||||
zip_code=property_info["address"]["postal_code"],
|
|
||||||
),
|
|
||||||
property_url="https://www.realtor.com/realestateandhomes-detail/"
|
|
||||||
+ property_info["details"]["permalink"],
|
|
||||||
stories=property_info["details"]["stories"],
|
|
||||||
year_built=property_info["details"]["year_built"],
|
|
||||||
price_per_sqft=property_info["basic"]["price"] // property_info["basic"]["sqft"]
|
|
||||||
if property_info["basic"]["sqft"] is not None and property_info["basic"]["price"] is not None
|
|
||||||
else None,
|
|
||||||
mls_id=property_id,
|
mls_id=property_id,
|
||||||
listing_type=self.listing_type,
|
property_url=f"{self.PROPERTY_URL}{property_info['details']['permalink']}",
|
||||||
lot_area_value=property_info["public_record"]["lot_size"]
|
address=self._parse_address(
|
||||||
if property_info["public_record"] is not None
|
property_info, search_type="handle_address"
|
||||||
else None,
|
),
|
||||||
beds_min=property_info["basic"]["beds"],
|
description=self._parse_description(property_info),
|
||||||
beds_max=property_info["basic"]["beds"],
|
|
||||||
baths_min=property_info["basic"]["baths"],
|
|
||||||
baths_max=property_info["basic"]["baths"],
|
|
||||||
sqft_min=property_info["basic"]["sqft"],
|
|
||||||
sqft_max=property_info["basic"]["sqft"],
|
|
||||||
price_min=property_info["basic"]["price"],
|
|
||||||
price_max=property_info["basic"]["price"],
|
|
||||||
)
|
)
|
||||||
]
|
]
|
||||||
|
|
||||||
def handle_area(self, variables: dict, return_total: bool = False) -> list[Property] | int:
|
def general_search(
|
||||||
|
self, variables: dict, search_type: str
|
||||||
|
) -> Dict[str, Union[int, list[Property]]]:
|
||||||
"""
|
"""
|
||||||
Handles a location area & returns a list of properties
|
Handles a location area & returns a list of properties
|
||||||
"""
|
"""
|
||||||
query = (
|
results_query = """{
|
||||||
"""query Home_search(
|
count
|
||||||
$city: String,
|
total
|
||||||
$county: [String],
|
results {
|
||||||
$state_code: String,
|
property_id
|
||||||
$postal_code: String
|
list_date
|
||||||
$offset: Int,
|
status
|
||||||
) {
|
last_sold_price
|
||||||
home_search(
|
last_sold_date
|
||||||
query: {
|
list_price
|
||||||
city: $city
|
price_per_sqft
|
||||||
county: $county
|
description {
|
||||||
postal_code: $postal_code
|
sqft
|
||||||
state_code: $state_code
|
beds
|
||||||
status: %s
|
baths_full
|
||||||
|
baths_half
|
||||||
|
lot_sqft
|
||||||
|
sold_price
|
||||||
|
year_built
|
||||||
|
garage
|
||||||
|
sold_price
|
||||||
|
type
|
||||||
|
name
|
||||||
|
stories
|
||||||
}
|
}
|
||||||
limit: 200
|
source {
|
||||||
offset: $offset
|
id
|
||||||
) {
|
listing_id
|
||||||
count
|
}
|
||||||
total
|
hoa {
|
||||||
results {
|
fee
|
||||||
property_id
|
}
|
||||||
description {
|
location {
|
||||||
baths
|
address {
|
||||||
beds
|
street_number
|
||||||
lot_sqft
|
street_name
|
||||||
sqft
|
street_suffix
|
||||||
text
|
unit
|
||||||
sold_price
|
city
|
||||||
stories
|
state_code
|
||||||
year_built
|
postal_code
|
||||||
garage
|
coordinate {
|
||||||
unit_number
|
lon
|
||||||
floor_number
|
lat
|
||||||
}
|
|
||||||
location {
|
|
||||||
address {
|
|
||||||
city
|
|
||||||
country
|
|
||||||
line
|
|
||||||
postal_code
|
|
||||||
state_code
|
|
||||||
state
|
|
||||||
street_direction
|
|
||||||
street_name
|
|
||||||
street_number
|
|
||||||
street_post_direction
|
|
||||||
street_suffix
|
|
||||||
unit
|
|
||||||
coordinate {
|
|
||||||
lon
|
|
||||||
lat
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
list_price
|
neighborhoods {
|
||||||
price_per_sqft
|
name
|
||||||
source {
|
|
||||||
id
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}"""
|
}
|
||||||
% self.listing_type.value.lower()
|
}"""
|
||||||
|
|
||||||
|
date_param = (
|
||||||
|
'sold_date: { min: "$today-%sD" }' % self.last_x_days
|
||||||
|
if self.listing_type == ListingType.SOLD and self.last_x_days
|
||||||
|
else (
|
||||||
|
'list_date: { min: "$today-%sD" }' % self.last_x_days
|
||||||
|
if self.last_x_days
|
||||||
|
else ""
|
||||||
|
)
|
||||||
)
|
)
|
||||||
|
|
||||||
|
sort_param = (
|
||||||
|
"sort: [{ field: sold_date, direction: desc }]"
|
||||||
|
if self.listing_type == ListingType.SOLD
|
||||||
|
else "sort: [{ field: list_date, direction: desc }]"
|
||||||
|
)
|
||||||
|
|
||||||
|
pending_or_contingent_param = "or_filters: { contingent: true, pending: true }" if self.pending_or_contingent else ""
|
||||||
|
|
||||||
|
if search_type == "comps": #: comps search, came from an address
|
||||||
|
query = """query Property_search(
|
||||||
|
$coordinates: [Float]!
|
||||||
|
$radius: String!
|
||||||
|
$offset: Int!,
|
||||||
|
) {
|
||||||
|
property_search(
|
||||||
|
query: {
|
||||||
|
nearby: {
|
||||||
|
coordinates: $coordinates
|
||||||
|
radius: $radius
|
||||||
|
}
|
||||||
|
status: %s
|
||||||
|
%s
|
||||||
|
}
|
||||||
|
%s
|
||||||
|
limit: 200
|
||||||
|
offset: $offset
|
||||||
|
) %s""" % (
|
||||||
|
self.listing_type.value.lower(),
|
||||||
|
date_param,
|
||||||
|
sort_param,
|
||||||
|
results_query,
|
||||||
|
)
|
||||||
|
elif search_type == "area": #: general search, came from a general location
|
||||||
|
query = """query Home_search(
|
||||||
|
$city: String,
|
||||||
|
$county: [String],
|
||||||
|
$state_code: String,
|
||||||
|
$postal_code: String
|
||||||
|
$offset: Int,
|
||||||
|
) {
|
||||||
|
home_search(
|
||||||
|
query: {
|
||||||
|
city: $city
|
||||||
|
county: $county
|
||||||
|
postal_code: $postal_code
|
||||||
|
state_code: $state_code
|
||||||
|
status: %s
|
||||||
|
%s
|
||||||
|
%s
|
||||||
|
}
|
||||||
|
%s
|
||||||
|
limit: 200
|
||||||
|
offset: $offset
|
||||||
|
) %s""" % (
|
||||||
|
self.listing_type.value.lower(),
|
||||||
|
date_param,
|
||||||
|
pending_or_contingent_param,
|
||||||
|
sort_param,
|
||||||
|
results_query,
|
||||||
|
)
|
||||||
|
else: #: general search, came from an address
|
||||||
|
query = (
|
||||||
|
"""query Property_search(
|
||||||
|
$property_id: [ID]!
|
||||||
|
$offset: Int!,
|
||||||
|
) {
|
||||||
|
property_search(
|
||||||
|
query: {
|
||||||
|
property_id: $property_id
|
||||||
|
}
|
||||||
|
limit: 1
|
||||||
|
offset: $offset
|
||||||
|
) %s""" % results_query)
|
||||||
|
|
||||||
payload = {
|
payload = {
|
||||||
"query": query,
|
"query": query,
|
||||||
"variables": variables,
|
"variables": variables,
|
||||||
}
|
}
|
||||||
|
|
||||||
response = self.session.post(self.search_url, json=payload)
|
response = self.session.post(self.SEARCH_GQL_URL, json=payload)
|
||||||
response.raise_for_status()
|
response.raise_for_status()
|
||||||
response_json = response.json()
|
response_json = response.json()
|
||||||
|
search_key = "home_search" if search_type == "area" else "property_search"
|
||||||
if return_total:
|
|
||||||
return response_json["data"]["home_search"]["total"]
|
|
||||||
|
|
||||||
properties: list[Property] = []
|
properties: list[Property] = []
|
||||||
|
|
||||||
|
@ -242,89 +412,164 @@ class RealtorScraper(Scraper):
|
||||||
response_json is None
|
response_json is None
|
||||||
or "data" not in response_json
|
or "data" not in response_json
|
||||||
or response_json["data"] is None
|
or response_json["data"] is None
|
||||||
or "home_search" not in response_json["data"]
|
or search_key not in response_json["data"]
|
||||||
or response_json["data"]["home_search"] is None
|
or response_json["data"][search_key] is None
|
||||||
or "results" not in response_json["data"]["home_search"]
|
or "results" not in response_json["data"][search_key]
|
||||||
):
|
):
|
||||||
return []
|
return {"total": 0, "properties": []}
|
||||||
|
|
||||||
for result in response_json["data"]["home_search"]["results"]:
|
for result in response_json["data"][search_key]["results"]:
|
||||||
self.counter += 1
|
self.counter += 1
|
||||||
address_one, _ = parse_address_one(result["location"]["address"]["line"])
|
mls = (
|
||||||
|
result["source"].get("id")
|
||||||
|
if "source" in result and isinstance(result["source"], dict)
|
||||||
|
else None
|
||||||
|
)
|
||||||
|
|
||||||
|
if not mls and self.mls_only:
|
||||||
|
continue
|
||||||
|
|
||||||
|
able_to_get_lat_long = (
|
||||||
|
result
|
||||||
|
and result.get("location")
|
||||||
|
and result["location"].get("address")
|
||||||
|
and result["location"]["address"].get("coordinate")
|
||||||
|
)
|
||||||
|
|
||||||
realty_property = Property(
|
realty_property = Property(
|
||||||
address=Address(
|
mls=mls,
|
||||||
address_one=address_one,
|
mls_id=result["source"].get("listing_id")
|
||||||
city=result["location"]["address"]["city"],
|
if "source" in result and isinstance(result["source"], dict)
|
||||||
state=result["location"]["address"]["state_code"],
|
|
||||||
zip_code=result["location"]["address"]["postal_code"],
|
|
||||||
address_two=parse_address_two(result["location"]["address"]["unit"]),
|
|
||||||
),
|
|
||||||
latitude=result["location"]["address"]["coordinate"]["lat"]
|
|
||||||
if result
|
|
||||||
and result.get("location")
|
|
||||||
and result["location"].get("address")
|
|
||||||
and result["location"]["address"].get("coordinate")
|
|
||||||
and "lat" in result["location"]["address"]["coordinate"]
|
|
||||||
else None,
|
else None,
|
||||||
longitude=result["location"]["address"]["coordinate"]["lon"]
|
property_url=f"{self.PROPERTY_URL}{result['property_id']}",
|
||||||
if result
|
status=result["status"].upper(),
|
||||||
and result.get("location")
|
list_price=result["list_price"],
|
||||||
and result["location"].get("address")
|
list_date=result["list_date"].split("T")[0]
|
||||||
and result["location"]["address"].get("coordinate")
|
if result.get("list_date")
|
||||||
and "lon" in result["location"]["address"]["coordinate"]
|
|
||||||
else None,
|
else None,
|
||||||
site_name=self.site_name,
|
prc_sqft=result.get("price_per_sqft"),
|
||||||
property_url="https://www.realtor.com/realestateandhomes-detail/" + result["property_id"],
|
last_sold_date=result.get("last_sold_date"),
|
||||||
stories=result["description"]["stories"],
|
hoa_fee=result["hoa"]["fee"]
|
||||||
year_built=result["description"]["year_built"],
|
if result.get("hoa") and isinstance(result["hoa"], dict)
|
||||||
price_per_sqft=result["price_per_sqft"],
|
else None,
|
||||||
mls_id=result["property_id"],
|
latitude=result["location"]["address"]["coordinate"].get("lat")
|
||||||
listing_type=self.listing_type,
|
if able_to_get_lat_long
|
||||||
lot_area_value=result["description"]["lot_sqft"],
|
else None,
|
||||||
beds_min=result["description"]["beds"],
|
longitude=result["location"]["address"]["coordinate"].get("lon")
|
||||||
beds_max=result["description"]["beds"],
|
if able_to_get_lat_long
|
||||||
baths_min=result["description"]["baths"],
|
else None,
|
||||||
baths_max=result["description"]["baths"],
|
address=self._parse_address(result, search_type="general_search"),
|
||||||
sqft_min=result["description"]["sqft"],
|
#: neighborhoods=self._parse_neighborhoods(result),
|
||||||
sqft_max=result["description"]["sqft"],
|
description=self._parse_description(result),
|
||||||
price_min=result["list_price"],
|
|
||||||
price_max=result["list_price"],
|
|
||||||
)
|
)
|
||||||
properties.append(realty_property)
|
properties.append(realty_property)
|
||||||
|
|
||||||
return properties
|
return {
|
||||||
|
"total": response_json["data"][search_key]["total"],
|
||||||
|
"properties": properties,
|
||||||
|
}
|
||||||
|
|
||||||
def search(self):
|
def search(self):
|
||||||
location_info = self.handle_location()
|
location_info = self.handle_location()
|
||||||
location_type = location_info["area_type"]
|
location_type = location_info["area_type"]
|
||||||
|
|
||||||
if location_type == "address":
|
|
||||||
property_id = location_info["mpr_id"]
|
|
||||||
return self.handle_address(property_id)
|
|
||||||
|
|
||||||
offset = 0
|
|
||||||
search_variables = {
|
search_variables = {
|
||||||
"city": location_info.get("city"),
|
"offset": 0,
|
||||||
"county": location_info.get("county"),
|
|
||||||
"state_code": location_info.get("state_code"),
|
|
||||||
"postal_code": location_info.get("postal_code"),
|
|
||||||
"offset": offset,
|
|
||||||
}
|
}
|
||||||
|
|
||||||
total = self.handle_area(search_variables, return_total=True)
|
search_type = "comps" if self.radius and location_type == "address" else "address" if location_type == "address" and not self.radius else "area"
|
||||||
|
if location_type == "address":
|
||||||
|
if not self.radius: #: single address search, non comps
|
||||||
|
property_id = location_info["mpr_id"]
|
||||||
|
search_variables |= {"property_id": property_id}
|
||||||
|
|
||||||
|
gql_results = self.general_search(search_variables, search_type=search_type)
|
||||||
|
if gql_results["total"] == 0:
|
||||||
|
listing_id = self.get_latest_listing_id(property_id)
|
||||||
|
if listing_id is None:
|
||||||
|
return self.handle_address(property_id)
|
||||||
|
else:
|
||||||
|
return self.handle_listing(listing_id)
|
||||||
|
else:
|
||||||
|
return gql_results["properties"]
|
||||||
|
|
||||||
|
else: #: general search, comps (radius)
|
||||||
|
coordinates = list(location_info["centroid"].values())
|
||||||
|
search_variables |= {
|
||||||
|
"coordinates": coordinates,
|
||||||
|
"radius": "{}mi".format(self.radius),
|
||||||
|
}
|
||||||
|
|
||||||
|
else: #: general search, location
|
||||||
|
search_variables |= {
|
||||||
|
"city": location_info.get("city"),
|
||||||
|
"county": location_info.get("county"),
|
||||||
|
"state_code": location_info.get("state_code"),
|
||||||
|
"postal_code": location_info.get("postal_code"),
|
||||||
|
}
|
||||||
|
|
||||||
|
result = self.general_search(search_variables, search_type=search_type)
|
||||||
|
total = result["total"]
|
||||||
|
homes = result["properties"]
|
||||||
|
|
||||||
homes = []
|
|
||||||
with ThreadPoolExecutor(max_workers=10) as executor:
|
with ThreadPoolExecutor(max_workers=10) as executor:
|
||||||
futures = [
|
futures = [
|
||||||
executor.submit(
|
executor.submit(
|
||||||
self.handle_area,
|
self.general_search,
|
||||||
variables=search_variables | {"offset": i},
|
variables=search_variables | {"offset": i},
|
||||||
return_total=False,
|
search_type=search_type,
|
||||||
)
|
)
|
||||||
for i in range(0, total, 200)
|
for i in range(200, min(total, 10000), 200)
|
||||||
]
|
]
|
||||||
|
|
||||||
for future in as_completed(futures):
|
for future in as_completed(futures):
|
||||||
homes.extend(future.result())
|
homes.extend(future.result()["properties"])
|
||||||
|
|
||||||
return homes
|
return homes
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def _parse_neighborhoods(result: dict) -> Optional[str]:
|
||||||
|
neighborhoods_list = []
|
||||||
|
neighborhoods = result["location"].get("neighborhoods", [])
|
||||||
|
|
||||||
|
if neighborhoods:
|
||||||
|
for neighborhood in neighborhoods:
|
||||||
|
name = neighborhood.get("name")
|
||||||
|
if name:
|
||||||
|
neighborhoods_list.append(name)
|
||||||
|
|
||||||
|
return ", ".join(neighborhoods_list) if neighborhoods_list else None
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def _parse_address(result: dict, search_type):
|
||||||
|
if search_type == "general_search":
|
||||||
|
return Address(
|
||||||
|
street=f"{result['location']['address']['street_number']} {result['location']['address']['street_name']} {result['location']['address']['street_suffix']}",
|
||||||
|
unit=result["location"]["address"]["unit"],
|
||||||
|
city=result["location"]["address"]["city"],
|
||||||
|
state=result["location"]["address"]["state_code"],
|
||||||
|
zip=result["location"]["address"]["postal_code"],
|
||||||
|
)
|
||||||
|
return Address(
|
||||||
|
street=f"{result['address']['street_number']} {result['address']['street_name']} {result['address']['street_suffix']}",
|
||||||
|
unit=result["address"]["unit"],
|
||||||
|
city=result["address"]["city"],
|
||||||
|
state=result["address"]["state_code"],
|
||||||
|
zip=result["address"]["postal_code"],
|
||||||
|
)
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def _parse_description(result: dict) -> Description:
|
||||||
|
description_data = result.get("description", {})
|
||||||
|
return Description(
|
||||||
|
style=description_data.get("type", "").upper(),
|
||||||
|
beds=description_data.get("beds"),
|
||||||
|
baths_full=description_data.get("baths_full"),
|
||||||
|
baths_half=description_data.get("baths_half"),
|
||||||
|
sqft=description_data.get("sqft"),
|
||||||
|
lot_sqft=description_data.get("lot_sqft"),
|
||||||
|
sold_price=description_data.get("sold_price"),
|
||||||
|
year_built=description_data.get("year_built"),
|
||||||
|
garage=description_data.get("garage"),
|
||||||
|
stories=description_data.get("stories"),
|
||||||
|
)
|
||||||
|
|
|
@ -1,246 +0,0 @@
|
||||||
"""
|
|
||||||
homeharvest.redfin.__init__
|
|
||||||
~~~~~~~~~~~~
|
|
||||||
|
|
||||||
This module implements the scraper for redfin.com
|
|
||||||
"""
|
|
||||||
import json
|
|
||||||
from typing import Any
|
|
||||||
from .. import Scraper
|
|
||||||
from ....utils import parse_address_two, parse_address_one
|
|
||||||
from ..models import Property, Address, PropertyType, ListingType, SiteName, Agent
|
|
||||||
from ....exceptions import NoResultsFound, SearchTooBroad
|
|
||||||
from datetime import datetime
|
|
||||||
|
|
||||||
|
|
||||||
class RedfinScraper(Scraper):
|
|
||||||
def __init__(self, scraper_input):
|
|
||||||
super().__init__(scraper_input)
|
|
||||||
self.listing_type = scraper_input.listing_type
|
|
||||||
|
|
||||||
def _handle_location(self):
|
|
||||||
url = "https://www.redfin.com/stingray/do/location-autocomplete?v=2&al=1&location={}".format(self.location)
|
|
||||||
|
|
||||||
response = self.session.get(url)
|
|
||||||
response_json = json.loads(response.text.replace("{}&&", ""))
|
|
||||||
|
|
||||||
def get_region_type(match_type: str):
|
|
||||||
if match_type == "4":
|
|
||||||
return "2" #: zip
|
|
||||||
elif match_type == "2":
|
|
||||||
return "6" #: city
|
|
||||||
elif match_type == "1":
|
|
||||||
return "address" #: address, needs to be handled differently
|
|
||||||
elif match_type == "11":
|
|
||||||
return "state"
|
|
||||||
|
|
||||||
if "exactMatch" not in response_json["payload"]:
|
|
||||||
raise NoResultsFound("No results found for location: {}".format(self.location))
|
|
||||||
|
|
||||||
if response_json["payload"]["exactMatch"] is not None:
|
|
||||||
target = response_json["payload"]["exactMatch"]
|
|
||||||
else:
|
|
||||||
target = response_json["payload"]["sections"][0]["rows"][0]
|
|
||||||
|
|
||||||
return target["id"].split("_")[1], get_region_type(target["type"])
|
|
||||||
|
|
||||||
def _parse_home(self, home: dict, single_search: bool = False) -> Property:
|
|
||||||
def get_value(key: str) -> Any | None:
|
|
||||||
if key in home and "value" in home[key]:
|
|
||||||
return home[key]["value"]
|
|
||||||
|
|
||||||
if not single_search:
|
|
||||||
address = Address(
|
|
||||||
address_one=parse_address_one(get_value("streetLine"))[0],
|
|
||||||
address_two=parse_address_one(get_value("streetLine"))[1],
|
|
||||||
city=home.get("city"),
|
|
||||||
state=home.get("state"),
|
|
||||||
zip_code=home.get("zip"),
|
|
||||||
)
|
|
||||||
else:
|
|
||||||
address_info = home.get("streetAddress")
|
|
||||||
address_one, address_two = parse_address_one(address_info.get("assembledAddress"))
|
|
||||||
|
|
||||||
address = Address(
|
|
||||||
address_one=address_one,
|
|
||||||
address_two=address_two,
|
|
||||||
city=home.get("city"),
|
|
||||||
state=home.get("state"),
|
|
||||||
zip_code=home.get("zip"),
|
|
||||||
)
|
|
||||||
|
|
||||||
url = "https://www.redfin.com{}".format(home["url"])
|
|
||||||
lot_size_data = home.get("lotSize")
|
|
||||||
|
|
||||||
if not isinstance(lot_size_data, int):
|
|
||||||
lot_size = lot_size_data.get("value", None) if isinstance(lot_size_data, dict) else None
|
|
||||||
else:
|
|
||||||
lot_size = lot_size_data
|
|
||||||
|
|
||||||
lat_long = get_value("latLong")
|
|
||||||
|
|
||||||
return Property(
|
|
||||||
site_name=self.site_name,
|
|
||||||
listing_type=self.listing_type,
|
|
||||||
address=address,
|
|
||||||
property_url=url,
|
|
||||||
beds_min=home["beds"] if "beds" in home else None,
|
|
||||||
beds_max=home["beds"] if "beds" in home else None,
|
|
||||||
baths_min=home["baths"] if "baths" in home else None,
|
|
||||||
baths_max=home["baths"] if "baths" in home else None,
|
|
||||||
price_min=get_value("price"),
|
|
||||||
price_max=get_value("price"),
|
|
||||||
sqft_min=get_value("sqFt"),
|
|
||||||
sqft_max=get_value("sqFt"),
|
|
||||||
stories=home["stories"] if "stories" in home else None,
|
|
||||||
agent=Agent( #: listingAgent, some have sellingAgent as well
|
|
||||||
name=home['listingAgent'].get('name') if 'listingAgent' in home else None,
|
|
||||||
phone=home['listingAgent'].get('phone') if 'listingAgent' in home else None,
|
|
||||||
),
|
|
||||||
description=home["listingRemarks"] if "listingRemarks" in home else None,
|
|
||||||
year_built=get_value("yearBuilt") if not single_search else home.get("yearBuilt"),
|
|
||||||
lot_area_value=lot_size,
|
|
||||||
property_type=PropertyType.from_int_code(home.get("propertyType")),
|
|
||||||
price_per_sqft=get_value("pricePerSqFt") if type(home.get("pricePerSqFt")) != int else home.get("pricePerSqFt"),
|
|
||||||
mls_id=get_value("mlsId"),
|
|
||||||
latitude=lat_long.get('latitude') if lat_long else None,
|
|
||||||
longitude=lat_long.get('longitude') if lat_long else None,
|
|
||||||
sold_date=datetime.fromtimestamp(home['soldDate'] / 1000) if 'soldDate' in home else None,
|
|
||||||
days_on_market=get_value("dom")
|
|
||||||
)
|
|
||||||
|
|
||||||
def _handle_rentals(self, region_id, region_type):
|
|
||||||
url = f"https://www.redfin.com/stingray/api/v1/search/rentals?al=1&isRentals=true®ion_id={region_id}®ion_type={region_type}&num_homes=100000"
|
|
||||||
|
|
||||||
response = self.session.get(url)
|
|
||||||
response.raise_for_status()
|
|
||||||
homes = response.json()
|
|
||||||
|
|
||||||
properties_list = []
|
|
||||||
|
|
||||||
for home in homes["homes"]:
|
|
||||||
home_data = home["homeData"]
|
|
||||||
rental_data = home["rentalExtension"]
|
|
||||||
|
|
||||||
property_url = f"https://www.redfin.com{home_data.get('url', '')}"
|
|
||||||
address_info = home_data.get("addressInfo", {})
|
|
||||||
centroid = address_info.get("centroid", {}).get("centroid", {})
|
|
||||||
address = Address(
|
|
||||||
address_one=parse_address_one(address_info.get("formattedStreetLine"))[0],
|
|
||||||
city=address_info.get("city"),
|
|
||||||
state=address_info.get("state"),
|
|
||||||
zip_code=address_info.get("zip"),
|
|
||||||
)
|
|
||||||
|
|
||||||
price_range = rental_data.get("rentPriceRange", {"min": None, "max": None})
|
|
||||||
bed_range = rental_data.get("bedRange", {"min": None, "max": None})
|
|
||||||
bath_range = rental_data.get("bathRange", {"min": None, "max": None})
|
|
||||||
sqft_range = rental_data.get("sqftRange", {"min": None, "max": None})
|
|
||||||
|
|
||||||
property_ = Property(
|
|
||||||
property_url=property_url,
|
|
||||||
site_name=SiteName.REDFIN,
|
|
||||||
listing_type=ListingType.FOR_RENT,
|
|
||||||
address=address,
|
|
||||||
description=rental_data.get("description"),
|
|
||||||
latitude=centroid.get("latitude"),
|
|
||||||
longitude=centroid.get("longitude"),
|
|
||||||
baths_min=bath_range.get("min"),
|
|
||||||
baths_max=bath_range.get("max"),
|
|
||||||
beds_min=bed_range.get("min"),
|
|
||||||
beds_max=bed_range.get("max"),
|
|
||||||
price_min=price_range.get("min"),
|
|
||||||
price_max=price_range.get("max"),
|
|
||||||
sqft_min=sqft_range.get("min"),
|
|
||||||
sqft_max=sqft_range.get("max"),
|
|
||||||
img_src=home_data.get("staticMapUrl"),
|
|
||||||
posted_time=rental_data.get("lastUpdated"),
|
|
||||||
bldg_name=rental_data.get("propertyName"),
|
|
||||||
)
|
|
||||||
|
|
||||||
properties_list.append(property_)
|
|
||||||
|
|
||||||
if not properties_list:
|
|
||||||
raise NoResultsFound("No rentals found for the given location.")
|
|
||||||
|
|
||||||
return properties_list
|
|
||||||
|
|
||||||
def _parse_building(self, building: dict) -> Property:
|
|
||||||
street_address = " ".join(
|
|
||||||
[
|
|
||||||
building["address"]["streetNumber"],
|
|
||||||
building["address"]["directionalPrefix"],
|
|
||||||
building["address"]["streetName"],
|
|
||||||
building["address"]["streetType"],
|
|
||||||
]
|
|
||||||
)
|
|
||||||
return Property(
|
|
||||||
site_name=self.site_name,
|
|
||||||
property_type=PropertyType("BUILDING"),
|
|
||||||
address=Address(
|
|
||||||
address_one=parse_address_one(street_address)[0],
|
|
||||||
city=building["address"]["city"],
|
|
||||||
state=building["address"]["stateOrProvinceCode"],
|
|
||||||
zip_code=building["address"]["postalCode"],
|
|
||||||
address_two=parse_address_two(
|
|
||||||
" ".join(
|
|
||||||
[
|
|
||||||
building["address"]["unitType"],
|
|
||||||
building["address"]["unitValue"],
|
|
||||||
]
|
|
||||||
)
|
|
||||||
),
|
|
||||||
),
|
|
||||||
property_url="https://www.redfin.com{}".format(building["url"]),
|
|
||||||
listing_type=self.listing_type,
|
|
||||||
unit_count=building.get("numUnitsForSale"),
|
|
||||||
)
|
|
||||||
|
|
||||||
def handle_address(self, home_id: str):
|
|
||||||
"""
|
|
||||||
EPs:
|
|
||||||
https://www.redfin.com/stingray/api/home/details/initialInfo?al=1&path=/TX/Austin/70-Rainey-St-78701/unit-1608/home/147337694
|
|
||||||
https://www.redfin.com/stingray/api/home/details/mainHouseInfoPanelInfo?propertyId=147337694&accessLevel=3
|
|
||||||
https://www.redfin.com/stingray/api/home/details/aboveTheFold?propertyId=147337694&accessLevel=3
|
|
||||||
https://www.redfin.com/stingray/api/home/details/belowTheFold?propertyId=147337694&accessLevel=3
|
|
||||||
"""
|
|
||||||
url = "https://www.redfin.com/stingray/api/home/details/aboveTheFold?propertyId={}&accessLevel=3".format(
|
|
||||||
home_id
|
|
||||||
)
|
|
||||||
|
|
||||||
response = self.session.get(url)
|
|
||||||
response_json = json.loads(response.text.replace("{}&&", ""))
|
|
||||||
|
|
||||||
parsed_home = self._parse_home(response_json["payload"]["addressSectionInfo"], single_search=True)
|
|
||||||
return [parsed_home]
|
|
||||||
|
|
||||||
def search(self):
|
|
||||||
region_id, region_type = self._handle_location()
|
|
||||||
|
|
||||||
if region_type == "state":
|
|
||||||
raise SearchTooBroad("State searches are not supported, please use a more specific location.")
|
|
||||||
|
|
||||||
if region_type == "address":
|
|
||||||
home_id = region_id
|
|
||||||
return self.handle_address(home_id)
|
|
||||||
|
|
||||||
if self.listing_type == ListingType.FOR_RENT:
|
|
||||||
return self._handle_rentals(region_id, region_type)
|
|
||||||
else:
|
|
||||||
if self.listing_type == ListingType.FOR_SALE:
|
|
||||||
url = f"https://www.redfin.com/stingray/api/gis?al=1®ion_id={region_id}®ion_type={region_type}&num_homes=100000"
|
|
||||||
else:
|
|
||||||
url = f"https://www.redfin.com/stingray/api/gis?al=1®ion_id={region_id}®ion_type={region_type}&sold_within_days=30&num_homes=100000"
|
|
||||||
response = self.session.get(url)
|
|
||||||
response_json = json.loads(response.text.replace("{}&&", ""))
|
|
||||||
|
|
||||||
if "payload" in response_json:
|
|
||||||
homes_list = response_json["payload"].get("homes", [])
|
|
||||||
buildings_list = response_json["payload"].get("buildings", {}).values()
|
|
||||||
|
|
||||||
homes = [self._parse_home(home) for home in homes_list] + [
|
|
||||||
self._parse_building(building) for building in buildings_list
|
|
||||||
]
|
|
||||||
return homes
|
|
||||||
else:
|
|
||||||
return []
|
|
|
@ -1,335 +0,0 @@
|
||||||
"""
|
|
||||||
homeharvest.zillow.__init__
|
|
||||||
~~~~~~~~~~~~
|
|
||||||
|
|
||||||
This module implements the scraper for zillow.com
|
|
||||||
"""
|
|
||||||
import re
|
|
||||||
import json
|
|
||||||
|
|
||||||
import tls_client
|
|
||||||
|
|
||||||
from .. import Scraper
|
|
||||||
from requests.exceptions import HTTPError
|
|
||||||
from ....utils import parse_address_one, parse_address_two
|
|
||||||
from ....exceptions import GeoCoordsNotFound, NoResultsFound
|
|
||||||
from ..models import Property, Address, ListingType, PropertyType, Agent
|
|
||||||
import urllib.parse
|
|
||||||
from datetime import datetime, timedelta
|
|
||||||
|
|
||||||
|
|
||||||
class ZillowScraper(Scraper):
|
|
||||||
def __init__(self, scraper_input):
|
|
||||||
session = tls_client.Session(
|
|
||||||
client_identifier="chrome112", random_tls_extension_order=True
|
|
||||||
)
|
|
||||||
|
|
||||||
super().__init__(scraper_input, session)
|
|
||||||
|
|
||||||
self.session.headers.update({
|
|
||||||
'authority': 'www.zillow.com',
|
|
||||||
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
|
|
||||||
'accept-language': 'en-US,en;q=0.9',
|
|
||||||
'cache-control': 'max-age=0',
|
|
||||||
'sec-fetch-dest': 'document',
|
|
||||||
'sec-fetch-mode': 'navigate',
|
|
||||||
'sec-fetch-site': 'same-origin',
|
|
||||||
'sec-fetch-user': '?1',
|
|
||||||
'upgrade-insecure-requests': '1',
|
|
||||||
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',
|
|
||||||
})
|
|
||||||
|
|
||||||
if not self.is_plausible_location(self.location):
|
|
||||||
raise NoResultsFound("Invalid location input: {}".format(self.location))
|
|
||||||
|
|
||||||
listing_type_to_url_path = {
|
|
||||||
ListingType.FOR_SALE: "for_sale",
|
|
||||||
ListingType.FOR_RENT: "for_rent",
|
|
||||||
ListingType.SOLD: "recently_sold",
|
|
||||||
}
|
|
||||||
|
|
||||||
self.url = f"https://www.zillow.com/homes/{listing_type_to_url_path[self.listing_type]}/{self.location}_rb/"
|
|
||||||
|
|
||||||
def is_plausible_location(self, location: str) -> bool:
|
|
||||||
url = (
|
|
||||||
"https://www.zillowstatic.com/autocomplete/v3/suggestions?q={"
|
|
||||||
"}&abKey=6666272a-4b99-474c-b857-110ec438732b&clientId=homepage-render"
|
|
||||||
).format(urllib.parse.quote(location))
|
|
||||||
|
|
||||||
resp = self.session.get(url)
|
|
||||||
|
|
||||||
return resp.json()["results"] != []
|
|
||||||
|
|
||||||
def search(self):
|
|
||||||
resp = self.session.get(self.url)
|
|
||||||
if resp.status_code != 200:
|
|
||||||
raise HTTPError(
|
|
||||||
f"bad response status code: {resp.status_code}"
|
|
||||||
)
|
|
||||||
content = resp.text
|
|
||||||
|
|
||||||
match = re.search(
|
|
||||||
r'<script id="__NEXT_DATA__" type="application/json">(.*?)</script>',
|
|
||||||
content,
|
|
||||||
re.DOTALL,
|
|
||||||
)
|
|
||||||
if not match:
|
|
||||||
raise NoResultsFound("No results were found for Zillow with the given Location.")
|
|
||||||
|
|
||||||
json_str = match.group(1)
|
|
||||||
data = json.loads(json_str)
|
|
||||||
|
|
||||||
if "searchPageState" in data["props"]["pageProps"]:
|
|
||||||
pattern = r'window\.mapBounds = \{\s*"west":\s*(-?\d+\.\d+),\s*"east":\s*(-?\d+\.\d+),\s*"south":\s*(-?\d+\.\d+),\s*"north":\s*(-?\d+\.\d+)\s*\};'
|
|
||||||
|
|
||||||
match = re.search(pattern, content)
|
|
||||||
|
|
||||||
if match:
|
|
||||||
coords = [float(coord) for coord in match.groups()]
|
|
||||||
return self._fetch_properties_backend(coords)
|
|
||||||
|
|
||||||
else:
|
|
||||||
raise GeoCoordsNotFound("Box bounds could not be located.")
|
|
||||||
|
|
||||||
elif "gdpClientCache" in data["props"]["pageProps"]:
|
|
||||||
gdp_client_cache = json.loads(data["props"]["pageProps"]["gdpClientCache"])
|
|
||||||
main_key = list(gdp_client_cache.keys())[0]
|
|
||||||
|
|
||||||
property_data = gdp_client_cache[main_key]["property"]
|
|
||||||
property = self._get_single_property_page(property_data)
|
|
||||||
|
|
||||||
return [property]
|
|
||||||
raise NoResultsFound("Specific property data not found in the response.")
|
|
||||||
|
|
||||||
def _fetch_properties_backend(self, coords):
|
|
||||||
url = "https://www.zillow.com/async-create-search-page-state"
|
|
||||||
|
|
||||||
filter_state_for_sale = {
|
|
||||||
"sortSelection": {
|
|
||||||
# "value": "globalrelevanceex"
|
|
||||||
"value": "days"
|
|
||||||
},
|
|
||||||
"isAllHomes": {"value": True},
|
|
||||||
}
|
|
||||||
|
|
||||||
filter_state_for_rent = {
|
|
||||||
"isForRent": {"value": True},
|
|
||||||
"isForSaleByAgent": {"value": False},
|
|
||||||
"isForSaleByOwner": {"value": False},
|
|
||||||
"isNewConstruction": {"value": False},
|
|
||||||
"isComingSoon": {"value": False},
|
|
||||||
"isAuction": {"value": False},
|
|
||||||
"isForSaleForeclosure": {"value": False},
|
|
||||||
"isAllHomes": {"value": True},
|
|
||||||
}
|
|
||||||
|
|
||||||
filter_state_sold = {
|
|
||||||
"isRecentlySold": {"value": True},
|
|
||||||
"isForSaleByAgent": {"value": False},
|
|
||||||
"isForSaleByOwner": {"value": False},
|
|
||||||
"isNewConstruction": {"value": False},
|
|
||||||
"isComingSoon": {"value": False},
|
|
||||||
"isAuction": {"value": False},
|
|
||||||
"isForSaleForeclosure": {"value": False},
|
|
||||||
"isAllHomes": {"value": True},
|
|
||||||
}
|
|
||||||
|
|
||||||
selected_filter = (
|
|
||||||
filter_state_for_rent
|
|
||||||
if self.listing_type == ListingType.FOR_RENT
|
|
||||||
else filter_state_for_sale
|
|
||||||
if self.listing_type == ListingType.FOR_SALE
|
|
||||||
else filter_state_sold
|
|
||||||
)
|
|
||||||
|
|
||||||
payload = {
|
|
||||||
"searchQueryState": {
|
|
||||||
"pagination": {},
|
|
||||||
"isMapVisible": True,
|
|
||||||
"mapBounds": {
|
|
||||||
"west": coords[0],
|
|
||||||
"east": coords[1],
|
|
||||||
"south": coords[2],
|
|
||||||
"north": coords[3],
|
|
||||||
},
|
|
||||||
"filterState": selected_filter,
|
|
||||||
"isListVisible": True,
|
|
||||||
"mapZoom": 11,
|
|
||||||
},
|
|
||||||
"wants": {"cat1": ["mapResults"]},
|
|
||||||
"isDebugRequest": False,
|
|
||||||
}
|
|
||||||
resp = self.session.put(url, json=payload)
|
|
||||||
if resp.status_code != 200:
|
|
||||||
raise HTTPError(
|
|
||||||
f"bad response status code: {resp.status_code}"
|
|
||||||
)
|
|
||||||
return self._parse_properties(resp.json())
|
|
||||||
|
|
||||||
@staticmethod
|
|
||||||
def parse_posted_time(time: str) -> datetime:
|
|
||||||
int_time = int(time.split(" ")[0])
|
|
||||||
|
|
||||||
if "hour" in time:
|
|
||||||
return datetime.now() - timedelta(hours=int_time)
|
|
||||||
|
|
||||||
if "day" in time:
|
|
||||||
return datetime.now() - timedelta(days=int_time)
|
|
||||||
|
|
||||||
def _parse_properties(self, property_data: dict):
|
|
||||||
mapresults = property_data["cat1"]["searchResults"]["mapResults"]
|
|
||||||
|
|
||||||
properties_list = []
|
|
||||||
|
|
||||||
for result in mapresults:
|
|
||||||
if "hdpData" in result:
|
|
||||||
home_info = result["hdpData"]["homeInfo"]
|
|
||||||
address_data = {
|
|
||||||
"address_one": parse_address_one(home_info.get("streetAddress"))[0],
|
|
||||||
"address_two": parse_address_two(home_info["unit"]) if "unit" in home_info else "#",
|
|
||||||
"city": home_info.get("city"),
|
|
||||||
"state": home_info.get("state"),
|
|
||||||
"zip_code": home_info.get("zipcode"),
|
|
||||||
}
|
|
||||||
property_obj = Property(
|
|
||||||
site_name=self.site_name,
|
|
||||||
address=Address(**address_data),
|
|
||||||
property_url=f"https://www.zillow.com{result['detailUrl']}",
|
|
||||||
tax_assessed_value=int(home_info["taxAssessedValue"]) if "taxAssessedValue" in home_info else None,
|
|
||||||
property_type=PropertyType(home_info.get("homeType")),
|
|
||||||
listing_type=ListingType(
|
|
||||||
home_info["statusType"] if "statusType" in home_info else self.listing_type
|
|
||||||
),
|
|
||||||
status_text=result.get("statusText"),
|
|
||||||
posted_time=self.parse_posted_time(result["variableData"]["text"])
|
|
||||||
if "variableData" in result
|
|
||||||
and "text" in result["variableData"]
|
|
||||||
and result["variableData"]["type"] == "TIME_ON_INFO"
|
|
||||||
else None,
|
|
||||||
price_min=home_info.get("price"),
|
|
||||||
price_max=home_info.get("price"),
|
|
||||||
beds_min=int(home_info["bedrooms"]) if "bedrooms" in home_info else None,
|
|
||||||
beds_max=int(home_info["bedrooms"]) if "bedrooms" in home_info else None,
|
|
||||||
baths_min=home_info.get("bathrooms"),
|
|
||||||
baths_max=home_info.get("bathrooms"),
|
|
||||||
sqft_min=int(home_info["livingArea"]) if "livingArea" in home_info else None,
|
|
||||||
sqft_max=int(home_info["livingArea"]) if "livingArea" in home_info else None,
|
|
||||||
price_per_sqft=int(home_info["price"] // home_info["livingArea"])
|
|
||||||
if "livingArea" in home_info and home_info["livingArea"] != 0 and "price" in home_info
|
|
||||||
else None,
|
|
||||||
latitude=result["latLong"]["latitude"],
|
|
||||||
longitude=result["latLong"]["longitude"],
|
|
||||||
lot_area_value=round(home_info["lotAreaValue"], 2) if "lotAreaValue" in home_info else None,
|
|
||||||
lot_area_unit=home_info.get("lotAreaUnit"),
|
|
||||||
img_src=result.get("imgSrc"),
|
|
||||||
)
|
|
||||||
|
|
||||||
properties_list.append(property_obj)
|
|
||||||
|
|
||||||
elif "isBuilding" in result:
|
|
||||||
price_string = result["price"].replace("$", "").replace(",", "").replace("+/mo", "")
|
|
||||||
|
|
||||||
match = re.search(r"(\d+)", price_string)
|
|
||||||
price_value = int(match.group(1)) if match else None
|
|
||||||
building_obj = Property(
|
|
||||||
property_url=f"https://www.zillow.com{result['detailUrl']}",
|
|
||||||
site_name=self.site_name,
|
|
||||||
property_type=PropertyType("BUILDING"),
|
|
||||||
listing_type=ListingType(result["statusType"]),
|
|
||||||
img_src=result.get("imgSrc"),
|
|
||||||
address=self._extract_address(result["address"]),
|
|
||||||
baths_min=result.get("minBaths"),
|
|
||||||
area_min=result.get("minArea"),
|
|
||||||
bldg_name=result.get("communityName"),
|
|
||||||
status_text=result.get("statusText"),
|
|
||||||
price_min=price_value if "+/mo" in result.get("price") else None,
|
|
||||||
price_max=price_value if "+/mo" in result.get("price") else None,
|
|
||||||
latitude=result.get("latLong", {}).get("latitude"),
|
|
||||||
longitude=result.get("latLong", {}).get("longitude"),
|
|
||||||
unit_count=result.get("unitCount"),
|
|
||||||
)
|
|
||||||
|
|
||||||
properties_list.append(building_obj)
|
|
||||||
|
|
||||||
return properties_list
|
|
||||||
|
|
||||||
def _get_single_property_page(self, property_data: dict):
|
|
||||||
"""
|
|
||||||
This method is used when a user enters the exact location & zillow returns just one property
|
|
||||||
"""
|
|
||||||
url = (
|
|
||||||
f"https://www.zillow.com{property_data['hdpUrl']}"
|
|
||||||
if "zillow.com" not in property_data["hdpUrl"]
|
|
||||||
else property_data["hdpUrl"]
|
|
||||||
)
|
|
||||||
address_data = property_data["address"]
|
|
||||||
address_one, address_two = parse_address_one(address_data["streetAddress"])
|
|
||||||
address = Address(
|
|
||||||
address_one=address_one,
|
|
||||||
address_two=address_two if address_two else "#",
|
|
||||||
city=address_data["city"],
|
|
||||||
state=address_data["state"],
|
|
||||||
zip_code=address_data["zipcode"],
|
|
||||||
)
|
|
||||||
property_type = property_data.get("homeType", None)
|
|
||||||
return Property(
|
|
||||||
site_name=self.site_name,
|
|
||||||
property_url=url,
|
|
||||||
property_type=PropertyType(property_type) if property_type in PropertyType.__members__ else None,
|
|
||||||
listing_type=self.listing_type,
|
|
||||||
address=address,
|
|
||||||
year_built=property_data.get("yearBuilt"),
|
|
||||||
tax_assessed_value=property_data.get("taxAssessedValue"),
|
|
||||||
lot_area_value=property_data.get("lotAreaValue"),
|
|
||||||
lot_area_unit=property_data["lotAreaUnits"].lower() if "lotAreaUnits" in property_data else None,
|
|
||||||
agent=Agent(
|
|
||||||
name=property_data.get("attributionInfo", {}).get("agentName")
|
|
||||||
),
|
|
||||||
stories=property_data.get("resoFacts", {}).get("stories"),
|
|
||||||
mls_id=property_data.get("attributionInfo", {}).get("mlsId"),
|
|
||||||
beds_min=property_data.get("bedrooms"),
|
|
||||||
beds_max=property_data.get("bedrooms"),
|
|
||||||
baths_min=property_data.get("bathrooms"),
|
|
||||||
baths_max=property_data.get("bathrooms"),
|
|
||||||
price_min=property_data.get("price"),
|
|
||||||
price_max=property_data.get("price"),
|
|
||||||
sqft_min=property_data.get("livingArea"),
|
|
||||||
sqft_max=property_data.get("livingArea"),
|
|
||||||
price_per_sqft=property_data.get("resoFacts", {}).get("pricePerSquareFoot"),
|
|
||||||
latitude=property_data.get("latitude"),
|
|
||||||
longitude=property_data.get("longitude"),
|
|
||||||
img_src=property_data.get("streetViewTileImageUrlMediumAddress"),
|
|
||||||
description=property_data.get("description"),
|
|
||||||
)
|
|
||||||
|
|
||||||
def _extract_address(self, address_str):
|
|
||||||
"""
|
|
||||||
Extract address components from a string formatted like '555 Wedglea Dr, Dallas, TX',
|
|
||||||
and return an Address object.
|
|
||||||
"""
|
|
||||||
parts = address_str.split(", ")
|
|
||||||
|
|
||||||
if len(parts) != 3:
|
|
||||||
raise ValueError(f"Unexpected address format: {address_str}")
|
|
||||||
|
|
||||||
address_one = parts[0].strip()
|
|
||||||
city = parts[1].strip()
|
|
||||||
state_zip = parts[2].split(" ")
|
|
||||||
|
|
||||||
if len(state_zip) == 1:
|
|
||||||
state = state_zip[0].strip()
|
|
||||||
zip_code = None
|
|
||||||
elif len(state_zip) == 2:
|
|
||||||
state = state_zip[0].strip()
|
|
||||||
zip_code = state_zip[1].strip()
|
|
||||||
else:
|
|
||||||
raise ValueError(f"Unexpected state/zip format in address: {address_str}")
|
|
||||||
|
|
||||||
address_one, address_two = parse_address_one(address_one)
|
|
||||||
return Address(
|
|
||||||
address_one=address_one,
|
|
||||||
address_two=address_two if address_two else "#",
|
|
||||||
city=city,
|
|
||||||
state=state,
|
|
||||||
zip_code=zip_code,
|
|
||||||
)
|
|
|
@ -1,18 +1,6 @@
|
||||||
class InvalidSite(Exception):
|
|
||||||
"""Raised when a provided site is does not exist."""
|
|
||||||
|
|
||||||
|
|
||||||
class InvalidListingType(Exception):
|
class InvalidListingType(Exception):
|
||||||
"""Raised when a provided listing type is does not exist."""
|
"""Raised when a provided listing type is does not exist."""
|
||||||
|
|
||||||
|
|
||||||
class NoResultsFound(Exception):
|
class NoResultsFound(Exception):
|
||||||
"""Raised when no results are found for the given location"""
|
"""Raised when no results are found for the given location"""
|
||||||
|
|
||||||
|
|
||||||
class GeoCoordsNotFound(Exception):
|
|
||||||
"""Raised when no property is found for the given address"""
|
|
||||||
|
|
||||||
|
|
||||||
class SearchTooBroad(Exception):
|
|
||||||
"""Raised when the search is too broad"""
|
|
||||||
|
|
|
@ -1,38 +1,71 @@
|
||||||
import re
|
from .core.scrapers.models import Property, ListingType
|
||||||
|
import pandas as pd
|
||||||
|
from .exceptions import InvalidListingType
|
||||||
|
|
||||||
|
ordered_properties = [
|
||||||
|
"property_url",
|
||||||
|
"mls",
|
||||||
|
"mls_id",
|
||||||
|
"status",
|
||||||
|
"style",
|
||||||
|
"street",
|
||||||
|
"unit",
|
||||||
|
"city",
|
||||||
|
"state",
|
||||||
|
"zip_code",
|
||||||
|
"beds",
|
||||||
|
"full_baths",
|
||||||
|
"half_baths",
|
||||||
|
"sqft",
|
||||||
|
"year_built",
|
||||||
|
"list_price",
|
||||||
|
"list_date",
|
||||||
|
"sold_price",
|
||||||
|
"last_sold_date",
|
||||||
|
"lot_sqft",
|
||||||
|
"price_per_sqft",
|
||||||
|
"latitude",
|
||||||
|
"longitude",
|
||||||
|
"stories",
|
||||||
|
"hoa_fee",
|
||||||
|
"parking_garage",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
def parse_address_one(street_address: str) -> tuple:
|
def process_result(result: Property) -> pd.DataFrame:
|
||||||
if not street_address:
|
prop_data = {prop: None for prop in ordered_properties}
|
||||||
return street_address, "#"
|
prop_data.update(result.__dict__)
|
||||||
|
|
||||||
apt_match = re.search(
|
if "address" in prop_data:
|
||||||
r"(APT\s*[\dA-Z]+|#[\dA-Z]+|UNIT\s*[\dA-Z]+|LOT\s*[\dA-Z]+|SUITE\s*[\dA-Z]+)$",
|
address_data = prop_data["address"]
|
||||||
street_address,
|
prop_data["street"] = address_data.street
|
||||||
re.I,
|
prop_data["unit"] = address_data.unit
|
||||||
)
|
prop_data["city"] = address_data.city
|
||||||
|
prop_data["state"] = address_data.state
|
||||||
|
prop_data["zip_code"] = address_data.zip
|
||||||
|
|
||||||
if apt_match:
|
prop_data["price_per_sqft"] = prop_data["prc_sqft"]
|
||||||
apt_str = apt_match.group().strip()
|
|
||||||
cleaned_apt_str = re.sub(r"(APT\s*|UNIT\s*|LOT\s*|SUITE\s*)", "#", apt_str, flags=re.I)
|
|
||||||
|
|
||||||
main_address = street_address.replace(apt_str, "").strip()
|
description = result.description
|
||||||
return main_address, cleaned_apt_str
|
prop_data["style"] = description.style
|
||||||
else:
|
prop_data["beds"] = description.beds
|
||||||
return street_address, "#"
|
prop_data["full_baths"] = description.baths_full
|
||||||
|
prop_data["half_baths"] = description.baths_half
|
||||||
|
prop_data["sqft"] = description.sqft
|
||||||
|
prop_data["lot_sqft"] = description.lot_sqft
|
||||||
|
prop_data["sold_price"] = description.sold_price
|
||||||
|
prop_data["year_built"] = description.year_built
|
||||||
|
prop_data["parking_garage"] = description.garage
|
||||||
|
prop_data["stories"] = description.stories
|
||||||
|
|
||||||
|
properties_df = pd.DataFrame([prop_data])
|
||||||
|
properties_df = properties_df.reindex(columns=ordered_properties)
|
||||||
|
|
||||||
|
return properties_df[ordered_properties]
|
||||||
|
|
||||||
|
|
||||||
def parse_address_two(street_address: str):
|
def validate_input(listing_type: str) -> None:
|
||||||
if not street_address:
|
if listing_type.upper() not in ListingType.__members__:
|
||||||
return "#"
|
raise InvalidListingType(
|
||||||
apt_match = re.search(
|
f"Provided listing type, '{listing_type}', does not exist."
|
||||||
r"(APT\s*[\dA-Z]+|#[\dA-Z]+|UNIT\s*[\dA-Z]+|LOT\s*[\dA-Z]+|SUITE\s*[\dA-Z]+)$",
|
)
|
||||||
street_address,
|
|
||||||
re.I,
|
|
||||||
)
|
|
||||||
|
|
||||||
if apt_match:
|
|
||||||
apt_str = apt_match.group().strip()
|
|
||||||
apt_str = re.sub(r"(APT\s*|UNIT\s*|LOT\s*|SUITE\s*)", "#", apt_str, flags=re.I)
|
|
||||||
return apt_str
|
|
||||||
else:
|
|
||||||
return "#"
|
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
[tool.poetry]
|
[tool.poetry]
|
||||||
name = "homeharvest"
|
name = "homeharvest"
|
||||||
version = "0.2.19"
|
version = "0.3.0"
|
||||||
description = "Real estate scraping library supporting Zillow, Realtor.com & Redfin."
|
description = "Real estate scraping library supporting Zillow, Realtor.com & Redfin."
|
||||||
authors = ["Zachary Hampton <zachary@zacharysproducts.com>", "Cullen Watson <cullen@cullen.ai>"]
|
authors = ["Zachary Hampton <zachary@zacharysproducts.com>", "Cullen Watson <cullen@cullen.ai>"]
|
||||||
homepage = "https://github.com/ZacharyHampton/HomeHarvest"
|
homepage = "https://github.com/ZacharyHampton/HomeHarvest"
|
||||||
|
|
|
@ -1,26 +1,78 @@
|
||||||
from homeharvest import scrape_property
|
from homeharvest import scrape_property
|
||||||
from homeharvest.exceptions import (
|
from homeharvest.exceptions import (
|
||||||
InvalidSite,
|
|
||||||
InvalidListingType,
|
InvalidListingType,
|
||||||
NoResultsFound,
|
NoResultsFound,
|
||||||
GeoCoordsNotFound,
|
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_realtor_pending_or_contingent():
|
||||||
|
pending_or_contingent_result = scrape_property(
|
||||||
|
location="Surprise, AZ",
|
||||||
|
pending_or_contingent=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
regular_result = scrape_property(
|
||||||
|
location="Surprise, AZ",
|
||||||
|
pending_or_contingent=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
assert all([result is not None for result in [pending_or_contingent_result, regular_result]])
|
||||||
|
assert len(pending_or_contingent_result) != len(regular_result)
|
||||||
|
|
||||||
|
|
||||||
|
def test_realtor_comps():
|
||||||
|
result = scrape_property(
|
||||||
|
location="2530 Al Lipscomb Way",
|
||||||
|
radius=0.5,
|
||||||
|
property_younger_than=180,
|
||||||
|
listing_type="sold",
|
||||||
|
)
|
||||||
|
|
||||||
|
assert result is not None and len(result) > 0
|
||||||
|
|
||||||
|
|
||||||
|
def test_realtor_last_x_days_sold():
|
||||||
|
days_result_30 = scrape_property(
|
||||||
|
location="Dallas, TX", listing_type="sold", property_younger_than=30
|
||||||
|
)
|
||||||
|
|
||||||
|
days_result_10 = scrape_property(
|
||||||
|
location="Dallas, TX", listing_type="sold", property_younger_than=10
|
||||||
|
)
|
||||||
|
|
||||||
|
assert all(
|
||||||
|
[result is not None for result in [days_result_30, days_result_10]]
|
||||||
|
) and len(days_result_30) != len(days_result_10)
|
||||||
|
|
||||||
|
|
||||||
|
def test_realtor_single_property():
|
||||||
|
results = [
|
||||||
|
scrape_property(
|
||||||
|
location="15509 N 172nd Dr, Surprise, AZ 85388",
|
||||||
|
listing_type="for_sale",
|
||||||
|
),
|
||||||
|
scrape_property(
|
||||||
|
location="2530 Al Lipscomb Way",
|
||||||
|
listing_type="for_sale",
|
||||||
|
),
|
||||||
|
]
|
||||||
|
|
||||||
|
assert all([result is not None for result in results])
|
||||||
|
|
||||||
|
|
||||||
def test_realtor():
|
def test_realtor():
|
||||||
results = [
|
results = [
|
||||||
scrape_property(
|
scrape_property(
|
||||||
location="2530 Al Lipscomb Way",
|
location="2530 Al Lipscomb Way",
|
||||||
site_name="realtor.com",
|
|
||||||
listing_type="for_sale",
|
listing_type="for_sale",
|
||||||
),
|
),
|
||||||
scrape_property(
|
scrape_property(
|
||||||
location="Phoenix, AZ", site_name=["realtor.com"], listing_type="for_rent"
|
location="Phoenix, AZ", listing_type="for_rent"
|
||||||
), #: does not support "city, state, USA" format
|
), #: does not support "city, state, USA" format
|
||||||
scrape_property(
|
scrape_property(
|
||||||
location="Dallas, TX", site_name="realtor.com", listing_type="sold"
|
location="Dallas, TX", listing_type="sold"
|
||||||
), #: does not support "city, state, USA" format
|
), #: does not support "city, state, USA" format
|
||||||
scrape_property(location="85281", site_name="realtor.com"),
|
scrape_property(location="85281"),
|
||||||
]
|
]
|
||||||
|
|
||||||
assert all([result is not None for result in results])
|
assert all([result is not None for result in results])
|
||||||
|
@ -30,11 +82,10 @@ def test_realtor():
|
||||||
bad_results += [
|
bad_results += [
|
||||||
scrape_property(
|
scrape_property(
|
||||||
location="abceefg ju098ot498hh9",
|
location="abceefg ju098ot498hh9",
|
||||||
site_name="realtor.com",
|
|
||||||
listing_type="for_sale",
|
listing_type="for_sale",
|
||||||
)
|
)
|
||||||
]
|
]
|
||||||
except (InvalidSite, InvalidListingType, NoResultsFound, GeoCoordsNotFound):
|
except (InvalidListingType, NoResultsFound):
|
||||||
assert True
|
assert True
|
||||||
|
|
||||||
assert all([result is None for result in bad_results])
|
assert all([result is None for result in bad_results])
|
||||||
|
|
|
@ -1,35 +0,0 @@
|
||||||
from homeharvest import scrape_property
|
|
||||||
from homeharvest.exceptions import (
|
|
||||||
InvalidSite,
|
|
||||||
InvalidListingType,
|
|
||||||
NoResultsFound,
|
|
||||||
GeoCoordsNotFound,
|
|
||||||
SearchTooBroad,
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
def test_redfin():
|
|
||||||
results = [
|
|
||||||
scrape_property(location="San Diego", site_name="redfin", listing_type="for_sale"),
|
|
||||||
scrape_property(location="2530 Al Lipscomb Way", site_name="redfin", listing_type="for_sale"),
|
|
||||||
scrape_property(location="Phoenix, AZ, USA", site_name=["redfin"], listing_type="for_rent"),
|
|
||||||
scrape_property(location="Dallas, TX, USA", site_name="redfin", listing_type="sold"),
|
|
||||||
scrape_property(location="85281", site_name="redfin"),
|
|
||||||
]
|
|
||||||
|
|
||||||
assert all([result is not None for result in results])
|
|
||||||
|
|
||||||
bad_results = []
|
|
||||||
try:
|
|
||||||
bad_results += [
|
|
||||||
scrape_property(
|
|
||||||
location="abceefg ju098ot498hh9",
|
|
||||||
site_name="redfin",
|
|
||||||
listing_type="for_sale",
|
|
||||||
),
|
|
||||||
scrape_property(location="Florida", site_name="redfin", listing_type="for_rent"),
|
|
||||||
]
|
|
||||||
except (InvalidSite, InvalidListingType, NoResultsFound, GeoCoordsNotFound, SearchTooBroad):
|
|
||||||
assert True
|
|
||||||
|
|
||||||
assert all([result is None for result in bad_results])
|
|
|
@ -1,24 +0,0 @@
|
||||||
from homeharvest.utils import parse_address_one, parse_address_two
|
|
||||||
|
|
||||||
|
|
||||||
def test_parse_address_one():
|
|
||||||
test_data = [
|
|
||||||
("4303 E Cactus Rd Apt 126", ("4303 E Cactus Rd", "#126")),
|
|
||||||
("1234 Elm Street apt 2B", ("1234 Elm Street", "#2B")),
|
|
||||||
("1234 Elm Street UNIT 3A", ("1234 Elm Street", "#3A")),
|
|
||||||
("1234 Elm Street unit 3A", ("1234 Elm Street", "#3A")),
|
|
||||||
("1234 Elm Street SuIte 3A", ("1234 Elm Street", "#3A")),
|
|
||||||
]
|
|
||||||
|
|
||||||
for input_data, (exp_addr_one, exp_addr_two) in test_data:
|
|
||||||
address_one, address_two = parse_address_one(input_data)
|
|
||||||
assert address_one == exp_addr_one
|
|
||||||
assert address_two == exp_addr_two
|
|
||||||
|
|
||||||
|
|
||||||
def test_parse_address_two():
|
|
||||||
test_data = [("Apt 126", "#126"), ("apt 2B", "#2B"), ("UNIT 3A", "#3A"), ("unit 3A", "#3A"), ("SuIte 3A", "#3A")]
|
|
||||||
|
|
||||||
for input_data, expected in test_data:
|
|
||||||
output = parse_address_two(input_data)
|
|
||||||
assert output == expected
|
|
|
@ -1,34 +0,0 @@
|
||||||
from homeharvest import scrape_property
|
|
||||||
from homeharvest.exceptions import (
|
|
||||||
InvalidSite,
|
|
||||||
InvalidListingType,
|
|
||||||
NoResultsFound,
|
|
||||||
GeoCoordsNotFound,
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
def test_zillow():
|
|
||||||
results = [
|
|
||||||
scrape_property(location="2530 Al Lipscomb Way", site_name="zillow", listing_type="for_sale"),
|
|
||||||
scrape_property(location="Phoenix, AZ, USA", site_name=["zillow"], listing_type="for_rent"),
|
|
||||||
scrape_property(location="Surprise, AZ", site_name=["zillow"], listing_type="for_sale"),
|
|
||||||
scrape_property(location="Dallas, TX, USA", site_name="zillow", listing_type="sold"),
|
|
||||||
scrape_property(location="85281", site_name="zillow"),
|
|
||||||
scrape_property(location="3268 88th st s, Lakewood", site_name="zillow", listing_type="for_rent"),
|
|
||||||
]
|
|
||||||
|
|
||||||
assert all([result is not None for result in results])
|
|
||||||
|
|
||||||
bad_results = []
|
|
||||||
try:
|
|
||||||
bad_results += [
|
|
||||||
scrape_property(
|
|
||||||
location="abceefg ju098ot498hh9",
|
|
||||||
site_name="zillow",
|
|
||||||
listing_type="for_sale",
|
|
||||||
)
|
|
||||||
]
|
|
||||||
except (InvalidSite, InvalidListingType, NoResultsFound, GeoCoordsNotFound):
|
|
||||||
assert True
|
|
||||||
|
|
||||||
assert all([result is None for result in bad_results])
|
|
Loading…
Reference in New Issue