Compare commits

...

40 Commits

Author SHA1 Message Date
Zachary Hampton
05713c76b0 - redfin bug fix
- .get
2023-09-21 11:27:12 -07:00
Cullen Watson
9120cc9bfe fix: remove line 2023-09-21 13:10:14 -05:00
Cullen Watson
eee4b19515 Merge branch 'master' of https://github.com/ZacharyHampton/HomeHarvest 2023-09-21 13:06:15 -05:00
Cullen Watson
c25961eded fix: KeyEror : [minBaths] 2023-09-21 13:06:06 -05:00
Zachary Hampton
0884c3d163 Update README.md 2023-09-21 09:55:29 -07:00
Cullen Watson
8f37bfdeb8 chore: version number 2023-09-21 11:19:23 -05:00
Cullen Watson
48c2338276 fix: keyerror 2023-09-21 11:18:37 -05:00
Cullen Watson
f58a1f4a74 docs: tryhomeharvest.com 2023-09-21 10:57:11 -05:00
Zachary Hampton
4cef926d7d Merge pull request #14 from ZacharyHampton/keep_duplicates_flag
Keep duplicates flag
2023-09-20 20:27:08 -07:00
Cullen Watson
e82eeaa59f docs: add keep duplicates flag 2023-09-20 20:25:50 -05:00
Cullen Watson
644f16b25b feat: keep duplicates flag 2023-09-20 20:24:18 -05:00
Cullen Watson
e9ddc6df92 docs: update tutorial vid for release v0.2.7 2023-09-19 22:18:49 -05:00
Cullen Watson
50fb1c391d docs: update property schema 2023-09-19 21:35:37 -05:00
Cullen Watson
4f91f9dadb chore: version number 2023-09-19 21:17:12 -05:00
Zachary Hampton
66e55173b1 Merge pull request #13 from ZacharyHampton/simplify_fields
fix: simplify fields
2023-09-19 19:16:18 -07:00
Cullen Watson
f6054e8746 fix: simplify fields 2023-09-19 21:13:20 -05:00
Cullen Watson
e8d9235ee6 chore: update version number 2023-09-19 16:43:59 -05:00
Cullen Watson
043f091158 fix: keyerror on address 2023-09-19 16:43:17 -05:00
Cullen Watson
eae8108978 docs: change cmd 2023-09-19 16:18:01 -05:00
Zachary Hampton
0a39357a07 Merge pull request #12 from ZacharyHampton/proxy_bug
fix: proxy add to session correctly
2023-09-19 14:07:25 -07:00
Cullen Watson
8f06d46ddb chore: version number 2023-09-19 16:07:06 -05:00
Cullen Watson
0dae14ccfc fix: proxy add to session correctly 2023-09-19 16:05:14 -05:00
Zachary Hampton
9aaabdd5d8 Merge pull request #11 from ZacharyHampton/proxy_support
Proxy support
2023-09-19 13:50:14 -07:00
Cullen Watson
cdf41fe9f2 fix: remove self.proxy 2023-09-19 15:49:50 -05:00
Cullen Watson
1f0feb836d refactor: move proxy to session 2023-09-19 15:48:46 -05:00
Cullen Watson
5f31beda46 chore: version number 2023-09-19 15:44:41 -05:00
Cullen Watson
fd9cdea499 feat: proxy support 2023-09-19 15:43:24 -05:00
Zachary Hampton
93a1cbe17f Merge pull request #10 from ZacharyHampton/cli_homeharvest
add cli
2023-09-19 13:07:27 -07:00
Cullen Watson
49d27943c4 add cli 2023-09-19 15:01:39 -05:00
Zachary Hampton
05fca9b7e6 Update README.md 2023-09-19 11:08:08 -07:00
Zachary Hampton
20ce44fb3a - redfin limiting bug fix 2023-09-19 10:37:10 -07:00
Zachary Hampton
52017c1bb5 Merge pull request #9 from ZacharyHampton/redfin_rental_support
feat(redfin): rental support
2023-09-19 10:28:02 -07:00
Cullen Watson
dba1c03081 feat(redfin): add sold listing_type 2023-09-19 12:27:13 -05:00
Cullen Watson
1fc2d8c549 feat(redfin): rental support 2023-09-19 11:58:20 -05:00
Zachary Hampton
02d112eea0 Merge pull request #8 from ZacharyHampton/fix/zillow-location-validation
- zillow location validation
2023-09-19 09:33:33 -07:00
Zachary Hampton
30e510882b - version bump and excel support 2023-09-19 09:26:52 -07:00
Zachary Hampton
78b56c2cac - zillow location validation 2023-09-19 09:25:08 -07:00
Cullen Watson
087854a688 Merge branch 'master' of https://github.com/ZacharyHampton/HomeHarvest 2023-09-19 00:04:03 -05:00
Cullen Watson
80586467a8 docs:add guide 2023-09-18 23:53:10 -05:00
Cullen Watson
3494b152b8 docs: change install cmd 2023-09-18 23:32:51 -05:00
16 changed files with 532 additions and 359 deletions

3
.gitignore vendored
View File

@@ -3,4 +3,5 @@
**/__pycache__/ **/__pycache__/
**/.pytest_cache/ **/.pytest_cache/
*.pyc *.pyc
/.ipynb_checkpoints/ /.ipynb_checkpoints/
*.csv

View File

@@ -55,7 +55,7 @@
" location=\"2530 Al Lipscomb Way\",\n", " location=\"2530 Al Lipscomb Way\",\n",
" site_name=\"zillow\",\n", " site_name=\"zillow\",\n",
" listing_type=\"for_sale\"\n", " listing_type=\"for_sale\"\n",
")," ")"
] ]
}, },
{ {

View File

@@ -4,22 +4,48 @@
[![Try with Replit](https://replit.com/badge?caption=Try%20with%20Replit)](https://replit.com/@ZacharyHampton/HomeHarvestDemo) [![Try with Replit](https://replit.com/badge?caption=Try%20with%20Replit)](https://replit.com/@ZacharyHampton/HomeHarvestDemo)
\
**Not technical?** Try out the web scraping tool on our site at [tryhomeharvest.com](https://tryhomeharvest.com).
*Looking to build a data-focused software product?* **[Book a call](https://calendly.com/zachary-products/15min)** *to work with us.* *Looking to build a data-focused software product?* **[Book a call](https://calendly.com/zachary-products/15min)** *to work with us.*
Check out another project we wrote: ***[JobSpy](https://github.com/cullenwatson/JobSpy)** a Python package for job scraping*
## Features ## Features
- Scrapes properties from **Zillow**, **Realtor.com** & **Redfin** simultaneously - Scrapes properties from **Zillow**, **Realtor.com** & **Redfin** simultaneously
- Aggregates the properties in a Pandas DataFrame - Aggregates the properties in a Pandas DataFrame
[Video Guide for HomeHarvest](https://youtu.be/JnV7eR2Ve2o) - _updated for release v0.2.7_
![homeharvest](https://github.com/ZacharyHampton/HomeHarvest/assets/78247585/b3d5d727-e67b-4a9f-85d8-1e65fd18620a) ![homeharvest](https://github.com/ZacharyHampton/HomeHarvest/assets/78247585/b3d5d727-e67b-4a9f-85d8-1e65fd18620a)
## Installation ## Installation
```bash ```bash
pip install --upgrade homeharvest pip install homeharvest
``` ```
_Python version >= [3.10](https://www.python.org/downloads/release/python-3100/) required_ _Python version >= [3.10](https://www.python.org/downloads/release/python-3100/) required_
## Usage ## Usage
### CLI
```bash
homeharvest "San Francisco, CA" -s zillow realtor.com redfin -l for_rent -o excel -f HomeHarvest
```
This will scrape properties from the specified sites for the given location and listing type, and save the results to an Excel file named `HomeHarvest.xlsx`.
By default:
- If `-s` or `--site_name` is not provided, it will scrape from all available sites.
- If `-l` or `--listing_type` is left blank, the default is `for_sale`. Other options are `for_rent` or `sold`.
- The `-o` or `--output` default format is `excel`. Options are `csv` or `excel`.
- If `-f` or `--filename` is left blank, the default is `HomeHarvest_<current_timestamp>`.
- If `-p` or `--proxy` is not provided, the scraper uses the local IP.
- Use `-k` or `--keep_duplicates` to keep duplicate properties based on address. If not provided, duplicates will be removed.
### Python
```py ```py
from homeharvest import scrape_property from homeharvest import scrape_property
import pandas as pd import pandas as pd
@@ -33,16 +59,17 @@ properties: pd.DataFrame = scrape_property(
#: Note, to export to CSV or Excel, use properties.to_csv() or properties.to_excel(). #: Note, to export to CSV or Excel, use properties.to_csv() or properties.to_excel().
print(properties) print(properties)
``` ```
## Output ## Output
```py ```py
>>> properties.head() >>> properties.head()
street city ... mls_id description property_url site_name listing_type apt_min_price apt_max_price ...
0 420 N Scottsdale Rd Tempe ... NaN NaN 0 https://www.redfin.com/AZ/Tempe/1003-W-Washing... redfin for_rent 1666.0 2750.0 ...
1 1255 E University Dr Tempe ... NaN NaN 1 https://www.redfin.com/AZ/Tempe/VELA-at-Town-L... redfin for_rent 1665.0 3763.0 ...
2 1979 E Rio Salado Pkwy Tempe ... NaN NaN 2 https://www.redfin.com/AZ/Tempe/Camden-Tempe/a... redfin for_rent 1939.0 3109.0 ...
3 548 S Wilson St Tempe ... None None 3 https://www.redfin.com/AZ/Tempe/Emerson-Park/a... redfin for_rent 1185.0 1817.0 ...
4 945 E Playa Del Norte Dr Unit 4027 Tempe ... NaN NaN 4 https://www.redfin.com/AZ/Tempe/Rio-Paradiso-A... redfin for_rent 1470.0 2235.0 ...
[5 rows x 23 columns] [5 rows x 41 columns]
``` ```
### Parameters for `scrape_properties()` ### Parameters for `scrape_properties()`
@@ -51,7 +78,9 @@ Required
├── location (str): address in various formats e.g. just zip, full address, city/state, etc. ├── location (str): address in various formats e.g. just zip, full address, city/state, etc.
└── listing_type (enum): for_rent, for_sale, sold └── listing_type (enum): for_rent, for_sale, sold
Optional Optional
├── site_name (List[enum], default=all three sites): zillow, realtor.com, redfin ├── site_name (list[enum], default=all three sites): zillow, realtor.com, redfin
├── proxy (str): in format 'http://user:pass@host:port' or [https, socks]
└── keep_duplicates (bool, default=False): whether to keep or remove duplicate properties based on address
``` ```
### Property Schema ### Property Schema
@@ -60,7 +89,7 @@ Property
├── Basic Information: ├── Basic Information:
│ ├── property_url (str) │ ├── property_url (str)
│ ├── site_name (enum): zillow, redfin, realtor.com │ ├── site_name (enum): zillow, redfin, realtor.com
│ ├── listing_type (enum: ListingType) │ ├── listing_type (enum): for_sale, for_rent, sold
│ └── property_type (enum): house, apartment, condo, townhouse, single_family, multi_family, building │ └── property_type (enum): house, apartment, condo, townhouse, single_family, multi_family, building
├── Address Details: ├── Address Details:
@@ -71,38 +100,38 @@ Property
│ ├── unit (str) │ ├── unit (str)
│ └── country (str) │ └── country (str)
├── Property Features: ├── House for Sale Features:
│ ├── price (int)
│ ├── tax_assessed_value (int) │ ├── tax_assessed_value (int)
│ ├── currency (str)
│ ├── square_feet (int)
│ ├── beds (int)
│ ├── baths (float)
│ ├── lot_area_value (float) │ ├── lot_area_value (float)
│ ├── lot_area_unit (str) │ ├── lot_area_unit (str)
│ ├── stories (int) │ ├── stories (int)
── year_built (int) ── year_built (int)
│ └── price_per_sqft (int)
├── Building for Sale and Apartment Details:
│ ├── bldg_name (str)
│ ├── beds_min (int)
│ ├── beds_max (int)
│ ├── baths_min (float)
│ ├── baths_max (float)
│ ├── sqft_min (int)
│ ├── sqft_max (int)
│ ├── price_min (int)
│ ├── price_max (int)
│ ├── area_min (int)
│ └── unit_count (int)
├── Miscellaneous Details: ├── Miscellaneous Details:
│ ├── price_per_sqft (int)
│ ├── mls_id (str) │ ├── mls_id (str)
│ ├── agent_name (str) │ ├── agent_name (str)
│ ├── img_src (str) │ ├── img_src (str)
│ ├── description (str) │ ├── description (str)
│ ├── status_text (str) │ ├── status_text (str)
── latitude (float) ── posted_time (str)
│ ├── longitude (float)
│ └── posted_time (str) [Only for Zillow]
── Building Details (for property_type: building): ── Location Details:
├── bldg_name (str) ├── latitude (float)
── bldg_unit_count (int) ── longitude (float)
│ ├── bldg_min_beds (int)
│ ├── bldg_min_baths (float)
│ └── bldg_min_area (int)
└── Apartment Details (for property type: apartment):
└── apt_min_price (int)
``` ```
## Supported Countries for Property Scraping ## Supported Countries for Property Scraping
@@ -116,7 +145,7 @@ The following exceptions may be raised when using HomeHarvest:
- `InvalidSite` - valid options: `zillow`, `redfin`, `realtor.com` - `InvalidSite` - valid options: `zillow`, `redfin`, `realtor.com`
- `InvalidListingType` - valid options: `for_sale`, `for_rent`, `sold` - `InvalidListingType` - valid options: `for_sale`, `for_rent`, `sold`
- `NoResultsFound` - no properties found from your input - `NoResultsFound` - no properties found from your input
- `GeoCoordsNotFound` - if Zillow scraper is not able to create geo-coordinates from the location you input - `GeoCoordsNotFound` - if Zillow scraper is not able to derive geo-coordinates from the location you input
## Frequently Asked Questions ## Frequently Asked Questions

View File

@@ -18,57 +18,54 @@ _scrapers = {
} }
def validate_input(site_name: str, listing_type: str) -> None: def _validate_input(site_name: str, listing_type: str) -> None:
if site_name.lower() not in _scrapers: if site_name.lower() not in _scrapers:
raise InvalidSite(f"Provided site, '{site_name}', does not exist.") raise InvalidSite(f"Provided site, '{site_name}', does not exist.")
if listing_type.upper() not in ListingType.__members__: if listing_type.upper() not in ListingType.__members__:
raise InvalidListingType( raise InvalidListingType(f"Provided listing type, '{listing_type}', does not exist.")
f"Provided listing type, '{listing_type}', does not exist."
)
def get_ordered_properties(result: Property) -> list[str]: def _get_ordered_properties(result: Property) -> list[str]:
return [ return [
"property_url", "property_url",
"site_name", "site_name",
"listing_type", "listing_type",
"property_type", "property_type",
"status_text", "status_text",
"currency", "baths_min",
"price", "baths_max",
"apt_min_price", "beds_min",
"beds_max",
"sqft_min",
"sqft_max",
"price_min",
"price_max",
"unit_count",
"tax_assessed_value", "tax_assessed_value",
"square_feet",
"price_per_sqft", "price_per_sqft",
"beds",
"baths",
"lot_area_value", "lot_area_value",
"lot_area_unit", "lot_area_unit",
"street_address", "address_one",
"unit", "address_two",
"city", "city",
"state", "state",
"zip_code", "zip_code",
"country",
"posted_time", "posted_time",
"bldg_min_beds", "area_min",
"bldg_min_baths",
"bldg_min_area",
"bldg_unit_count",
"bldg_name", "bldg_name",
"stories", "stories",
"year_built", "year_built",
"agent_name", "agent_name",
"mls_id", "mls_id",
"description",
"img_src", "img_src",
"latitude", "latitude",
"longitude", "longitude",
"description",
] ]
def process_result(result: Property) -> pd.DataFrame: def _process_result(result: Property) -> pd.DataFrame:
prop_data = result.__dict__ prop_data = result.__dict__
prop_data["site_name"] = prop_data["site_name"].value prop_data["site_name"] = prop_data["site_name"].value
@@ -79,42 +76,38 @@ def process_result(result: Property) -> pd.DataFrame:
prop_data["property_type"] = None prop_data["property_type"] = None
if "address" in prop_data: if "address" in prop_data:
address_data = prop_data["address"] address_data = prop_data["address"]
prop_data["street_address"] = address_data.street_address prop_data["address_one"] = address_data.address_one
prop_data["unit"] = address_data.unit prop_data["address_two"] = address_data.address_two
prop_data["city"] = address_data.city prop_data["city"] = address_data.city
prop_data["state"] = address_data.state prop_data["state"] = address_data.state
prop_data["zip_code"] = address_data.zip_code prop_data["zip_code"] = address_data.zip_code
prop_data["country"] = address_data.country
del prop_data["address"] del prop_data["address"]
properties_df = pd.DataFrame([prop_data]) properties_df = pd.DataFrame([prop_data])
properties_df = properties_df[get_ordered_properties(result)] properties_df = properties_df[_get_ordered_properties(result)]
return properties_df return properties_df
def _scrape_single_site( def _scrape_single_site(location: str, site_name: str, listing_type: str, proxy: str = None) -> pd.DataFrame:
location: str, site_name: str, listing_type: str
) -> pd.DataFrame:
""" """
Helper function to scrape a single site. Helper function to scrape a single site.
""" """
validate_input(site_name, listing_type) _validate_input(site_name, listing_type)
scraper_input = ScraperInput( scraper_input = ScraperInput(
location=location, location=location,
listing_type=ListingType[listing_type.upper()], listing_type=ListingType[listing_type.upper()],
site_name=SiteName.get_by_value(site_name.lower()), site_name=SiteName.get_by_value(site_name.lower()),
proxy=proxy,
) )
site = _scrapers[site_name.lower()](scraper_input) site = _scrapers[site_name.lower()](scraper_input)
results = site.search() results = site.search()
properties_dfs = [process_result(result) for result in results] properties_dfs = [_process_result(result) for result in results]
properties_dfs = [ properties_dfs = [df.dropna(axis=1, how="all") for df in properties_dfs if not df.empty]
df.dropna(axis=1, how="all") for df in properties_dfs if not df.empty
]
if not properties_dfs: if not properties_dfs:
return pd.DataFrame() return pd.DataFrame()
@@ -125,6 +118,8 @@ def scrape_property(
location: str, location: str,
site_name: Union[str, list[str]] = None, site_name: Union[str, list[str]] = None,
listing_type: str = "for_sale", listing_type: str = "for_sale",
proxy: str = None,
keep_duplicates: bool = False
) -> pd.DataFrame: ) -> pd.DataFrame:
""" """
Scrape property from various sites from a given location and listing type. Scrape property from various sites from a given location and listing type.
@@ -144,14 +139,12 @@ def scrape_property(
results = [] results = []
if len(site_name) == 1: if len(site_name) == 1:
final_df = _scrape_single_site(location, site_name[0], listing_type) final_df = _scrape_single_site(location, site_name[0], listing_type, proxy)
results.append(final_df) results.append(final_df)
else: else:
with ThreadPoolExecutor() as executor: with ThreadPoolExecutor() as executor:
futures = { futures = {
executor.submit( executor.submit(_scrape_single_site, location, s_name, listing_type, proxy): s_name
_scrape_single_site, location, s_name, listing_type
): s_name
for s_name in site_name for s_name in site_name
} }
@@ -166,14 +159,13 @@ def scrape_property(
final_df = pd.concat(results, ignore_index=True) final_df = pd.concat(results, ignore_index=True)
columns_to_track = ["street_address", "city", "unit"] columns_to_track = ["address_one", "address_two", "city"]
#: validate they exist, otherwise create them #: validate they exist, otherwise create them
for col in columns_to_track: for col in columns_to_track:
if col not in final_df.columns: if col not in final_df.columns:
final_df[col] = None final_df[col] = None
final_df = final_df.drop_duplicates( if not keep_duplicates:
subset=["street_address", "city", "unit"], keep="first" final_df = final_df.drop_duplicates(subset=columns_to_track, keep="first")
)
return final_df return final_df

73
homeharvest/cli.py Normal file
View File

@@ -0,0 +1,73 @@
import argparse
import datetime
from homeharvest import scrape_property
def main():
parser = argparse.ArgumentParser(description="Home Harvest Property Scraper")
parser.add_argument("location", type=str, help="Location to scrape (e.g., San Francisco, CA)")
parser.add_argument(
"-s",
"--site_name",
type=str,
nargs="*",
default=None,
help="Site name(s) to scrape from (e.g., realtor, zillow)",
)
parser.add_argument(
"-l",
"--listing_type",
type=str,
default="for_sale",
choices=["for_sale", "for_rent", "sold"],
help="Listing type to scrape",
)
parser.add_argument(
"-o",
"--output",
type=str,
default="excel",
choices=["excel", "csv"],
help="Output format",
)
parser.add_argument(
"-f",
"--filename",
type=str,
default=None,
help="Name of the output file (without extension)",
)
parser.add_argument(
"-k",
"--keep_duplicates",
action="store_true",
help="Keep duplicate properties based on address"
)
parser.add_argument("-p", "--proxy", type=str, default=None, help="Proxy to use for scraping")
args = parser.parse_args()
result = scrape_property(args.location, args.site_name, args.listing_type, proxy=args.proxy, keep_duplicates=args.keep_duplicates)
if not args.filename:
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
args.filename = f"HomeHarvest_{timestamp}"
if args.output == "excel":
output_filename = f"{args.filename}.xlsx"
result.to_excel(output_filename, index=False)
print(f"Excel file saved as {output_filename}")
elif args.output == "csv":
output_filename = f"{args.filename}.csv"
result.to_csv(output_filename, index=False)
print(f"CSV file saved as {output_filename}")
if __name__ == "__main__":
main()

View File

@@ -8,7 +8,7 @@ class ScraperInput:
location: str location: str
listing_type: ListingType listing_type: ListingType
site_name: SiteName site_name: SiteName
proxy_url: str | None = None proxy: str | None = None
class Scraper: class Scraper:
@@ -17,15 +17,13 @@ class Scraper:
self.listing_type = scraper_input.listing_type self.listing_type = scraper_input.listing_type
self.session = requests.Session() self.session = requests.Session()
if scraper_input.proxy:
proxy_url = scraper_input.proxy
proxies = {"http": proxy_url, "https": proxy_url}
self.session.proxies.update(proxies)
self.listing_type = scraper_input.listing_type self.listing_type = scraper_input.listing_type
self.site_name = scraper_input.site_name self.site_name = scraper_input.site_name
if scraper_input.proxy_url:
self.session.proxies = {
"http": scraper_input.proxy_url,
"https": scraper_input.proxy_url,
}
def search(self) -> list[Property]: def search(self) -> list[Property]:
... ...

View File

@@ -1,5 +1,6 @@
from dataclasses import dataclass from dataclasses import dataclass
from enum import Enum from enum import Enum
from typing import Tuple
class SiteName(Enum): class SiteName(Enum):
@@ -56,12 +57,11 @@ class PropertyType(Enum):
@dataclass @dataclass
class Address: class Address:
street_address: str address_one: str | None = None
city: str address_two: str | None = "#"
state: str city: str | None = None
zip_code: str state: str | None = None
unit: str | None = None zip_code: str | None = None
country: str | None = None
@dataclass @dataclass
@@ -73,12 +73,7 @@ class Property:
property_type: PropertyType | None = None property_type: PropertyType | None = None
# house for sale # house for sale
price: int | None = None
tax_assessed_value: int | None = None tax_assessed_value: int | None = None
currency: str | None = None
square_feet: int | None = None
beds: int | None = None
baths: float | None = None
lot_area_value: float | None = None lot_area_value: float | None = None
lot_area_unit: str | None = None lot_area_unit: str | None = None
stories: int | None = None stories: int | None = None
@@ -90,16 +85,25 @@ class Property:
img_src: str | None = None img_src: str | None = None
description: str | None = None description: str | None = None
status_text: str | None = None status_text: str | None = None
latitude: float | None = None
longitude: float | None = None
posted_time: str | None = None posted_time: str | None = None
# building for sale # building for sale
bldg_name: str | None = None bldg_name: str | None = None
bldg_unit_count: int | None = None area_min: int | None = None
bldg_min_beds: int | None = None
bldg_min_baths: float | None = None
bldg_min_area: int | None = None
# apt beds_min: int | None = None
apt_min_price: int | None = None beds_max: int | None = None
baths_min: float | None = None
baths_max: float | None = None
sqft_min: int | None = None
sqft_max: int | None = None
price_min: int | None = None
price_max: int | None = None
unit_count: int | None = None
latitude: float | None = None
longitude: float | None = None

View File

@@ -1,16 +1,23 @@
import json """
homeharvest.realtor.__init__
~~~~~~~~~~~~
This module implements the scraper for relator.com
"""
from ..models import Property, Address from ..models import Property, Address
from .. import Scraper from .. import Scraper
from typing import Any, Generator
from ....exceptions import NoResultsFound from ....exceptions import NoResultsFound
from ....utils import parse_address_two, parse_unit from ....utils import parse_address_one, parse_address_two
from concurrent.futures import ThreadPoolExecutor, as_completed from concurrent.futures import ThreadPoolExecutor, as_completed
class RealtorScraper(Scraper): class RealtorScraper(Scraper):
def __init__(self, scraper_input): def __init__(self, scraper_input):
self.counter = 1
super().__init__(scraper_input) super().__init__(scraper_input)
self.search_url = "https://www.realtor.com/api/v1/rdc_search_srp?client_id=rdc-search-new-communities&schema=vesta" self.search_url = (
"https://www.realtor.com/api/v1/rdc_search_srp?client_id=rdc-search-new-communities&schema=vesta"
)
def handle_location(self): def handle_location(self):
headers = { headers = {
@@ -50,6 +57,9 @@ class RealtorScraper(Scraper):
return result[0] return result[0]
def handle_address(self, property_id: str) -> list[Property]: def handle_address(self, property_id: str) -> list[Property]:
"""
Handles a specific address & returns one property
"""
query = """query Property($property_id: ID!) { query = """query Property($property_id: ID!) {
property(id: $property_id) { property(id: $property_id) {
property_id property_id
@@ -108,43 +118,45 @@ class RealtorScraper(Scraper):
response_json = response.json() response_json = response.json()
property_info = response_json["data"]["property"] property_info = response_json["data"]["property"]
street_address, unit = parse_address_two(property_info["address"]["line"]) address_one, address_two = parse_address_one(property_info["address"]["line"])
return [ return [
Property( Property(
site_name=self.site_name, site_name=self.site_name,
address=Address( address=Address(
street_address=street_address, address_one=address_one,
address_two=address_two,
city=property_info["address"]["city"], city=property_info["address"]["city"],
state=property_info["address"]["state_code"], state=property_info["address"]["state_code"],
zip_code=property_info["address"]["postal_code"], zip_code=property_info["address"]["postal_code"],
unit=unit,
country="USA",
), ),
property_url="https://www.realtor.com/realestateandhomes-detail/" property_url="https://www.realtor.com/realestateandhomes-detail/"
+ property_info["details"]["permalink"], + property_info["details"]["permalink"],
beds=property_info["basic"]["beds"],
baths=property_info["basic"]["baths"],
stories=property_info["details"]["stories"], stories=property_info["details"]["stories"],
year_built=property_info["details"]["year_built"], year_built=property_info["details"]["year_built"],
square_feet=property_info["basic"]["sqft"], price_per_sqft=property_info["basic"]["price"] // property_info["basic"]["sqft"]
price_per_sqft=property_info["basic"]["price"] if property_info["basic"]["sqft"] is not None and property_info["basic"]["price"] is not None
// property_info["basic"]["sqft"]
if property_info["basic"]["sqft"] is not None
and property_info["basic"]["price"] is not None
else None, else None,
price=property_info["basic"]["price"],
mls_id=property_id, mls_id=property_id,
listing_type=self.listing_type, listing_type=self.listing_type,
lot_area_value=property_info["public_record"]["lot_size"] lot_area_value=property_info["public_record"]["lot_size"]
if property_info["public_record"] is not None if property_info["public_record"] is not None
else None, else None,
beds_min=property_info["basic"]["beds"],
beds_max=property_info["basic"]["beds"],
baths_min=property_info["basic"]["baths"],
baths_max=property_info["basic"]["baths"],
sqft_min=property_info["basic"]["sqft"],
sqft_max=property_info["basic"]["sqft"],
price_min=property_info["basic"]["price"],
price_max=property_info["basic"]["price"],
) )
] ]
def handle_area( def handle_area(self, variables: dict, return_total: bool = False) -> list[Property] | int:
self, variables: dict, return_total: bool = False """
) -> list[Property] | int: Handles a location area & returns a list of properties
"""
query = ( query = (
"""query Home_search( """query Home_search(
$city: String, $city: String,
@@ -237,17 +249,15 @@ class RealtorScraper(Scraper):
return [] return []
for result in response_json["data"]["home_search"]["results"]: for result in response_json["data"]["home_search"]["results"]:
street_address, unit = parse_address_two( self.counter += 1
result["location"]["address"]["line"] address_one, _ = parse_address_one(result["location"]["address"]["line"])
)
realty_property = Property( realty_property = Property(
address=Address( address=Address(
street_address=street_address, address_one=address_one,
city=result["location"]["address"]["city"], city=result["location"]["address"]["city"],
state=result["location"]["address"]["state_code"], state=result["location"]["address"]["state_code"],
zip_code=result["location"]["address"]["postal_code"], zip_code=result["location"]["address"]["postal_code"],
unit=parse_unit(result["location"]["address"]["unit"]), address_two=parse_address_two(result["location"]["address"]["unit"]),
country="USA",
), ),
latitude=result["location"]["address"]["coordinate"]["lat"] latitude=result["location"]["address"]["coordinate"]["lat"]
if result if result
@@ -264,20 +274,22 @@ class RealtorScraper(Scraper):
and "lon" in result["location"]["address"]["coordinate"] and "lon" in result["location"]["address"]["coordinate"]
else None, else None,
site_name=self.site_name, site_name=self.site_name,
property_url="https://www.realtor.com/realestateandhomes-detail/" property_url="https://www.realtor.com/realestateandhomes-detail/" + result["property_id"],
+ result["property_id"],
beds=result["description"]["beds"],
baths=result["description"]["baths"],
stories=result["description"]["stories"], stories=result["description"]["stories"],
year_built=result["description"]["year_built"], year_built=result["description"]["year_built"],
square_feet=result["description"]["sqft"],
price_per_sqft=result["price_per_sqft"], price_per_sqft=result["price_per_sqft"],
price=result["list_price"],
mls_id=result["property_id"], mls_id=result["property_id"],
listing_type=self.listing_type, listing_type=self.listing_type,
lot_area_value=result["description"]["lot_sqft"], lot_area_value=result["description"]["lot_sqft"],
beds_min=result["description"]["beds"],
beds_max=result["description"]["beds"],
baths_min=result["description"]["baths"],
baths_max=result["description"]["baths"],
sqft_min=result["description"]["sqft"],
sqft_max=result["description"]["sqft"],
price_min=result["list_price"],
price_max=result["list_price"],
) )
properties.append(realty_property) properties.append(realty_property)
return properties return properties

View File

@@ -1,8 +1,14 @@
"""
homeharvest.redfin.__init__
~~~~~~~~~~~~
This module implements the scraper for redfin.com
"""
import json import json
from typing import Any from typing import Any
from .. import Scraper from .. import Scraper
from ....utils import parse_address_two, parse_unit from ....utils import parse_address_two, parse_address_one
from ..models import Property, Address, PropertyType from ..models import Property, Address, PropertyType, ListingType, SiteName
from ....exceptions import NoResultsFound from ....exceptions import NoResultsFound
@@ -12,9 +18,7 @@ class RedfinScraper(Scraper):
self.listing_type = scraper_input.listing_type self.listing_type = scraper_input.listing_type
def _handle_location(self): def _handle_location(self):
url = "https://www.redfin.com/stingray/do/location-autocomplete?v=2&al=1&location={}".format( url = "https://www.redfin.com/stingray/do/location-autocomplete?v=2&al=1&location={}".format(self.location)
self.location
)
response = self.session.get(url) response = self.session.get(url)
response_json = json.loads(response.text.replace("{}&&", "")) response_json = json.loads(response.text.replace("{}&&", ""))
@@ -28,9 +32,7 @@ class RedfinScraper(Scraper):
return "address" #: address, needs to be handled differently return "address" #: address, needs to be handled differently
if "exactMatch" not in response_json["payload"]: if "exactMatch" not in response_json["payload"]:
raise NoResultsFound( raise NoResultsFound("No results found for location: {}".format(self.location))
"No results found for location: {}".format(self.location)
)
if response_json["payload"]["exactMatch"] is not None: if response_json["payload"]["exactMatch"] is not None:
target = response_json["payload"]["exactMatch"] target = response_json["payload"]["exactMatch"]
@@ -45,39 +47,30 @@ class RedfinScraper(Scraper):
return home[key]["value"] return home[key]["value"]
if not single_search: if not single_search:
street_address, unit = parse_address_two(get_value("streetLine"))
unit = parse_unit(get_value("streetLine"))
address = Address( address = Address(
street_address=street_address, address_one=parse_address_one(get_value("streetLine"))[0],
city=home["city"], address_two=parse_address_one(get_value("streetLine"))[1],
state=home["state"], city=home.get("city"),
zip_code=home["zip"], state=home.get("state"),
unit=unit, zip_code=home.get("zip"),
country="USA",
) )
else: else:
address_info = home["streetAddress"] address_info = home.get("streetAddress")
street_address, unit = parse_address_two(address_info["assembledAddress"]) address_one, address_two = parse_address_one(address_info.get("assembledAddress"))
address = Address( address = Address(
street_address=street_address, address_one=address_one,
city=home["city"], address_two=address_two,
state=home["state"], city=home.get("city"),
zip_code=home["zip"], state=home.get("state"),
unit=unit, zip_code=home.get("zip"),
country="USA",
) )
url = "https://www.redfin.com{}".format(home["url"]) url = "https://www.redfin.com{}".format(home["url"])
#: property_type = home["propertyType"] if "propertyType" in home else None
lot_size_data = home.get("lotSize") lot_size_data = home.get("lotSize")
if not isinstance(lot_size_data, int): if not isinstance(lot_size_data, int):
lot_size = ( lot_size = lot_size_data.get("value", None) if isinstance(lot_size_data, dict) else None
lot_size_data.get("value", None)
if isinstance(lot_size_data, dict)
else None
)
else: else:
lot_size = lot_size_data lot_size = lot_size_data
@@ -86,28 +79,82 @@ class RedfinScraper(Scraper):
listing_type=self.listing_type, listing_type=self.listing_type,
address=address, address=address,
property_url=url, property_url=url,
beds=home["beds"] if "beds" in home else None, beds_min=home["beds"] if "beds" in home else None,
baths=home["baths"] if "baths" in home else None, beds_max=home["beds"] if "beds" in home else None,
baths_min=home["baths"] if "baths" in home else None,
baths_max=home["baths"] if "baths" in home else None,
price_min=get_value("price"),
price_max=get_value("price"),
sqft_min=get_value("sqFt"),
sqft_max=get_value("sqFt"),
stories=home["stories"] if "stories" in home else None, stories=home["stories"] if "stories" in home else None,
agent_name=get_value("listingAgent"), agent_name=get_value("listingAgent"),
description=home["listingRemarks"] if "listingRemarks" in home else None, description=home["listingRemarks"] if "listingRemarks" in home else None,
year_built=get_value("yearBuilt") year_built=get_value("yearBuilt") if not single_search else home["yearBuilt"],
if not single_search
else home["yearBuilt"],
square_feet=get_value("sqFt"),
lot_area_value=lot_size, lot_area_value=lot_size,
property_type=PropertyType.from_int_code(home.get("propertyType")), property_type=PropertyType.from_int_code(home.get("propertyType")),
price_per_sqft=get_value("pricePerSqFt"), price_per_sqft=get_value("pricePerSqFt") if type(home.get("pricePerSqFt")) != int else home.get("pricePerSqFt"),
price=get_value("price"),
mls_id=get_value("mlsId"), mls_id=get_value("mlsId"),
latitude=home["latLong"]["latitude"] latitude=home["latLong"]["latitude"] if "latLong" in home and "latitude" in home["latLong"] else None,
if "latLong" in home and "latitude" in home["latLong"] longitude=home["latLong"]["longitude"] if "latLong" in home and "longitude" in home["latLong"] else None,
else None,
longitude=home["latLong"]["longitude"]
if "latLong" in home and "longitude" in home["latLong"]
else None,
) )
def _handle_rentals(self, region_id, region_type):
url = f"https://www.redfin.com/stingray/api/v1/search/rentals?al=1&isRentals=true&region_id={region_id}&region_type={region_type}&num_homes=100000"
response = self.session.get(url)
response.raise_for_status()
homes = response.json()
properties_list = []
for home in homes["homes"]:
home_data = home["homeData"]
rental_data = home["rentalExtension"]
property_url = f"https://www.redfin.com{home_data.get('url', '')}"
address_info = home_data.get("addressInfo", {})
centroid = address_info.get("centroid", {}).get("centroid", {})
address = Address(
address_one=parse_address_one(address_info.get("formattedStreetLine"))[0],
city=address_info.get("city"),
state=address_info.get("state"),
zip_code=address_info.get("zip"),
)
price_range = rental_data.get("rentPriceRange", {"min": None, "max": None})
bed_range = rental_data.get("bedRange", {"min": None, "max": None})
bath_range = rental_data.get("bathRange", {"min": None, "max": None})
sqft_range = rental_data.get("sqftRange", {"min": None, "max": None})
property_ = Property(
property_url=property_url,
site_name=SiteName.REDFIN,
listing_type=ListingType.FOR_RENT,
address=address,
description=rental_data.get("description"),
latitude=centroid.get("latitude"),
longitude=centroid.get("longitude"),
baths_min=bath_range.get("min"),
baths_max=bath_range.get("max"),
beds_min=bed_range.get("min"),
beds_max=bed_range.get("max"),
price_min=price_range.get("min"),
price_max=price_range.get("max"),
sqft_min=sqft_range.get("min"),
sqft_max=sqft_range.get("max"),
img_src=home_data.get("staticMapUrl"),
posted_time=rental_data.get("lastUpdated"),
bldg_name=rental_data.get("propertyName"),
)
properties_list.append(property_)
if not properties_list:
raise NoResultsFound("No rentals found for the given location.")
return properties_list
def _parse_building(self, building: dict) -> Property: def _parse_building(self, building: dict) -> Property:
street_address = " ".join( street_address = " ".join(
[ [
@@ -117,16 +164,15 @@ class RedfinScraper(Scraper):
building["address"]["streetType"], building["address"]["streetType"],
] ]
) )
street_address, unit = parse_address_two(street_address)
return Property( return Property(
site_name=self.site_name, site_name=self.site_name,
property_type=PropertyType("BUILDING"), property_type=PropertyType("BUILDING"),
address=Address( address=Address(
street_address=street_address, address_one=parse_address_one(street_address)[0],
city=building["address"]["city"], city=building["address"]["city"],
state=building["address"]["stateOrProvinceCode"], state=building["address"]["stateOrProvinceCode"],
zip_code=building["address"]["postalCode"], zip_code=building["address"]["postalCode"],
unit=parse_unit( address_two=parse_address_two(
" ".join( " ".join(
[ [
building["address"]["unitType"], building["address"]["unitType"],
@@ -137,7 +183,7 @@ class RedfinScraper(Scraper):
), ),
property_url="https://www.redfin.com{}".format(building["url"]), property_url="https://www.redfin.com{}".format(building["url"]),
listing_type=self.listing_type, listing_type=self.listing_type,
bldg_unit_count=building["numUnitsForSale"], unit_count=building.get("numUnitsForSale"),
) )
def handle_address(self, home_id: str): def handle_address(self, home_id: str):
@@ -148,7 +194,6 @@ class RedfinScraper(Scraper):
https://www.redfin.com/stingray/api/home/details/aboveTheFold?propertyId=147337694&accessLevel=3 https://www.redfin.com/stingray/api/home/details/aboveTheFold?propertyId=147337694&accessLevel=3
https://www.redfin.com/stingray/api/home/details/belowTheFold?propertyId=147337694&accessLevel=3 https://www.redfin.com/stingray/api/home/details/belowTheFold?propertyId=147337694&accessLevel=3
""" """
url = "https://www.redfin.com/stingray/api/home/details/aboveTheFold?propertyId={}&accessLevel=3".format( url = "https://www.redfin.com/stingray/api/home/details/aboveTheFold?propertyId={}&accessLevel=3".format(
home_id home_id
) )
@@ -156,9 +201,7 @@ class RedfinScraper(Scraper):
response = self.session.get(url) response = self.session.get(url)
response_json = json.loads(response.text.replace("{}&&", "")) response_json = json.loads(response.text.replace("{}&&", ""))
parsed_home = self._parse_home( parsed_home = self._parse_home(response_json["payload"]["addressSectionInfo"], single_search=True)
response_json["payload"]["addressSectionInfo"], single_search=True
)
return [parsed_home] return [parsed_home]
def search(self): def search(self):
@@ -168,18 +211,23 @@ class RedfinScraper(Scraper):
home_id = region_id home_id = region_id
return self.handle_address(home_id) return self.handle_address(home_id)
url = "https://www.redfin.com/stingray/api/gis?al=1&region_id={}&region_type={}".format( if self.listing_type == ListingType.FOR_RENT:
region_id, region_type return self._handle_rentals(region_id, region_type)
) else:
if self.listing_type == ListingType.FOR_SALE:
url = f"https://www.redfin.com/stingray/api/gis?al=1&region_id={region_id}&region_type={region_type}&num_homes=100000"
else:
url = f"https://www.redfin.com/stingray/api/gis?al=1&region_id={region_id}&region_type={region_type}&sold_within_days=30&num_homes=100000"
response = self.session.get(url)
response_json = json.loads(response.text.replace("{}&&", ""))
response = self.session.get(url) if "payload" in response_json:
response_json = json.loads(response.text.replace("{}&&", "")) homes_list = response_json["payload"].get("homes", [])
buildings_list = response_json["payload"].get("buildings", {}).values()
homes = [ homes = [self._parse_home(home) for home in homes_list] + [
self._parse_home(home) for home in response_json["payload"]["homes"] self._parse_building(building) for building in buildings_list
] + [ ]
self._parse_building(building) return homes
for building in response_json["payload"]["buildings"].values() else:
] return []
return homes

View File

@@ -1,8 +1,13 @@
"""
homeharvest.zillow.__init__
~~~~~~~~~~~~
This module implements the scraper for zillow.com
"""
import re import re
import json import json
import string
from .. import Scraper from .. import Scraper
from ....utils import parse_address_two, parse_unit from ....utils import parse_address_one, parse_address_two
from ....exceptions import GeoCoordsNotFound, NoResultsFound from ....exceptions import GeoCoordsNotFound, NoResultsFound
from ..models import Property, Address, ListingType, PropertyType from ..models import Property, Address, ListingType, PropertyType
@@ -10,27 +15,27 @@ from ..models import Property, Address, ListingType, PropertyType
class ZillowScraper(Scraper): class ZillowScraper(Scraper):
def __init__(self, scraper_input): def __init__(self, scraper_input):
super().__init__(scraper_input) super().__init__(scraper_input)
self.listing_type = scraper_input.listing_type
if not self.is_plausible_location(self.location): if not self.is_plausible_location(self.location):
raise NoResultsFound("Invalid location input: {}".format(self.location)) raise NoResultsFound("Invalid location input: {}".format(self.location))
if self.listing_type == ListingType.FOR_SALE:
self.url = f"https://www.zillow.com/homes/for_sale/{self.location}_rb/"
elif self.listing_type == ListingType.FOR_RENT:
self.url = f"https://www.zillow.com/homes/for_rent/{self.location}_rb/"
else:
self.url = f"https://www.zillow.com/homes/recently_sold/{self.location}_rb/"
@staticmethod listing_type_to_url_path = {
def is_plausible_location(location: str) -> bool: ListingType.FOR_SALE: "for_sale",
blocks = location.split() ListingType.FOR_RENT: "for_rent",
for block in blocks: ListingType.SOLD: "recently_sold",
if ( }
any(char.isdigit() for char in block)
and any(char.isalpha() for char in block) self.url = f"https://www.zillow.com/homes/{listing_type_to_url_path[self.listing_type]}/{self.location}_rb/"
and len(block) > 6
): def is_plausible_location(self, location: str) -> bool:
return False url = (
return True "https://www.zillowstatic.com/autocomplete/v3/suggestions?q={"
"}&abKey=6666272a-4b99-474c-b857-110ec438732b&clientId=homepage-render"
).format(location)
response = self.session.get(url)
return response.json()["results"] != []
def search(self): def search(self):
resp = self.session.get(self.url, headers=self._get_headers()) resp = self.session.get(self.url, headers=self._get_headers())
@@ -43,9 +48,7 @@ class ZillowScraper(Scraper):
re.DOTALL, re.DOTALL,
) )
if not match: if not match:
raise NoResultsFound( raise NoResultsFound("No results were found for Zillow with the given Location.")
"No results were found for Zillow with the given Location."
)
json_str = match.group(1) json_str = match.group(1)
data = json.loads(json_str) data = json.loads(json_str)
@@ -144,85 +147,70 @@ class ZillowScraper(Scraper):
if "hdpData" in result: if "hdpData" in result:
home_info = result["hdpData"]["homeInfo"] home_info = result["hdpData"]["homeInfo"]
address_data = { address_data = {
"street_address": parse_address_two(home_info["streetAddress"])[0], "address_one": parse_address_one(home_info.get("streetAddress"))[0],
"unit": parse_unit(home_info["unit"]) "address_two": parse_address_two(home_info["unit"]) if "unit" in home_info else "#",
if "unit" in home_info "city": home_info.get("city"),
else None, "state": home_info.get("state"),
"city": home_info["city"], "zip_code": home_info.get("zipcode"),
"state": home_info["state"],
"zip_code": home_info["zipcode"],
"country": home_info["country"],
} }
property_data = { property_obj = Property(
"site_name": self.site_name, site_name=self.site_name,
"address": Address(**address_data), address=Address(**address_data),
"property_url": f"https://www.zillow.com{result['detailUrl']}", property_url=f"https://www.zillow.com{result['detailUrl']}",
"beds": int(home_info["bedrooms"]) tax_assessed_value=int(home_info["taxAssessedValue"]) if "taxAssessedValue" in home_info else None,
if "bedrooms" in home_info property_type=PropertyType(home_info.get("homeType")),
else None, listing_type=ListingType(
"baths": home_info.get("bathrooms"), home_info["statusType"] if "statusType" in home_info else self.listing_type
"square_feet": int(home_info["livingArea"])
if "livingArea" in home_info
else None,
"currency": home_info["currency"],
"price": home_info.get("price"),
"tax_assessed_value": int(home_info["taxAssessedValue"])
if "taxAssessedValue" in home_info
else None,
"property_type": PropertyType(home_info["homeType"]),
"listing_type": ListingType(
home_info["statusType"]
if "statusType" in home_info
else self.listing_type
), ),
"lot_area_value": round(home_info["lotAreaValue"], 2) status_text=result.get("statusText"),
if "lotAreaValue" in home_info posted_time=result["variableData"]["text"]
else None,
"lot_area_unit": home_info.get("lotAreaUnit"),
"latitude": result["latLong"]["latitude"],
"longitude": result["latLong"]["longitude"],
"status_text": result.get("statusText"),
"posted_time": result["variableData"]["text"]
if "variableData" in result if "variableData" in result
and "text" in result["variableData"] and "text" in result["variableData"]
and result["variableData"]["type"] == "TIME_ON_INFO" and result["variableData"]["type"] == "TIME_ON_INFO"
else None, else None,
"img_src": result.get("imgSrc"), price_min=home_info.get("price"),
"price_per_sqft": int(home_info["price"] // home_info["livingArea"]) price_max=home_info.get("price"),
if "livingArea" in home_info and "price" in home_info beds_min=int(home_info["bedrooms"]) if "bedrooms" in home_info else None,
beds_max=int(home_info["bedrooms"]) if "bedrooms" in home_info else None,
baths_min=home_info.get("bathrooms"),
baths_max=home_info.get("bathrooms"),
sqft_min=int(home_info["livingArea"]) if "livingArea" in home_info else None,
sqft_max=int(home_info["livingArea"]) if "livingArea" in home_info else None,
price_per_sqft=int(home_info["price"] // home_info["livingArea"])
if "livingArea" in home_info and home_info["livingArea"] != 0 and "price" in home_info
else None, else None,
} latitude=result["latLong"]["latitude"],
property_obj = Property(**property_data) longitude=result["latLong"]["longitude"],
lot_area_value=round(home_info["lotAreaValue"], 2) if "lotAreaValue" in home_info else None,
lot_area_unit=home_info.get("lotAreaUnit"),
img_src=result.get("imgSrc"),
)
properties_list.append(property_obj) properties_list.append(property_obj)
elif "isBuilding" in result: elif "isBuilding" in result:
price = result["price"] price_string = result["price"].replace("$", "").replace(",", "").replace("+/mo", "")
building_data = {
"property_url": f"https://www.zillow.com{result['detailUrl']}", match = re.search(r"(\d+)", price_string)
"site_name": self.site_name, price_value = int(match.group(1)) if match else None
"property_type": PropertyType("BUILDING"), building_obj = Property(
"listing_type": ListingType(result["statusType"]), property_url=f"https://www.zillow.com{result['detailUrl']}",
"img_src": result["imgSrc"], site_name=self.site_name,
"price": int(price.replace("From $", "").replace(",", "")) property_type=PropertyType("BUILDING"),
if "From $" in price listing_type=ListingType(result["statusType"]),
else None, img_src=result.get("imgSrc"),
"apt_min_price": int( address=self._extract_address(result["address"]),
price.replace("$", "").replace(",", "").replace("+/mo", "") baths_min=result.get("minBaths"),
) area_min=result.get("minArea"),
if "+/mo" in price bldg_name=result.get("communityName"),
else None, status_text=result.get("statusText"),
"address": self._extract_address(result["address"]), price_min=price_value if "+/mo" in result.get("price") else None,
"bldg_min_beds": result["minBeds"], price_max=price_value if "+/mo" in result.get("price") else None,
"currency": "USD", latitude=result.get("latLong", {}).get("latitude"),
"bldg_min_baths": result["minBaths"], longitude=result.get("latLong", {}).get("longitude"),
"bldg_min_area": result.get("minArea"), unit_count=result.get("unitCount"),
"bldg_unit_count": result["unitCount"], )
"bldg_name": result.get("communityName"),
"status_text": result["statusText"],
"latitude": result["latLong"]["latitude"],
"longitude": result["latLong"]["longitude"],
}
building_obj = Property(**building_data)
properties_list.append(building_obj) properties_list.append(building_obj)
return properties_list return properties_list
@@ -237,43 +225,41 @@ class ZillowScraper(Scraper):
else property_data["hdpUrl"] else property_data["hdpUrl"]
) )
address_data = property_data["address"] address_data = property_data["address"]
street_address, unit = parse_address_two(address_data["streetAddress"]) address_one, address_two = parse_address_one(address_data["streetAddress"])
address = Address( address = Address(
street_address=street_address, address_one=address_one,
unit=unit, address_two=address_two if address_two else "#",
city=address_data["city"], city=address_data["city"],
state=address_data["state"], state=address_data["state"],
zip_code=address_data["zipcode"], zip_code=address_data["zipcode"],
country=property_data.get("country"),
) )
property_type = property_data.get("homeType", None) property_type = property_data.get("homeType", None)
return Property( return Property(
site_name=self.site_name, site_name=self.site_name,
address=address,
property_url=url, property_url=url,
beds=property_data.get("bedrooms", None), property_type=PropertyType(property_type),
baths=property_data.get("bathrooms", None), listing_type=self.listing_type,
year_built=property_data.get("yearBuilt", None), address=address,
price=property_data.get("price", None), year_built=property_data.get("yearBuilt"),
tax_assessed_value=property_data.get("taxAssessedValue", None), tax_assessed_value=property_data.get("taxAssessedValue"),
lot_area_value=property_data.get("lotAreaValue"),
lot_area_unit=property_data["lotAreaUnits"].lower() if "lotAreaUnits" in property_data else None,
agent_name=property_data.get("attributionInfo", {}).get("agentName"),
stories=property_data.get("resoFacts", {}).get("stories"),
mls_id=property_data.get("attributionInfo", {}).get("mlsId"),
beds_min=property_data.get("bedrooms"),
beds_max=property_data.get("bedrooms"),
baths_min=property_data.get("bathrooms"),
baths_max=property_data.get("bathrooms"),
price_min=property_data.get("price"),
price_max=property_data.get("price"),
sqft_min=property_data.get("livingArea"),
sqft_max=property_data.get("livingArea"),
price_per_sqft=property_data.get("resoFacts", {}).get("pricePerSquareFoot"),
latitude=property_data.get("latitude"), latitude=property_data.get("latitude"),
longitude=property_data.get("longitude"), longitude=property_data.get("longitude"),
img_src=property_data.get("streetViewTileImageUrlMediumAddress"), img_src=property_data.get("streetViewTileImageUrlMediumAddress"),
currency=property_data.get("currency", None), description=property_data.get("description"),
lot_area_value=property_data.get("lotAreaValue"),
lot_area_unit=property_data["lotAreaUnits"].lower()
if "lotAreaUnits" in property_data
else None,
agent_name=property_data.get("attributionInfo", {}).get("agentName", None),
stories=property_data.get("resoFacts", {}).get("stories", None),
description=property_data.get("description", None),
mls_id=property_data.get("attributionInfo", {}).get("mlsId", None),
price_per_sqft=property_data.get("resoFacts", {}).get(
"pricePerSquareFoot", None
),
square_feet=property_data.get("livingArea", None),
property_type=PropertyType(property_type),
listing_type=self.listing_type,
) )
def _extract_address(self, address_str): def _extract_address(self, address_str):
@@ -286,7 +272,7 @@ class ZillowScraper(Scraper):
if len(parts) != 3: if len(parts) != 3:
raise ValueError(f"Unexpected address format: {address_str}") raise ValueError(f"Unexpected address format: {address_str}")
street_address = parts[0].strip() address_one = parts[0].strip()
city = parts[1].strip() city = parts[1].strip()
state_zip = parts[2].split(" ") state_zip = parts[2].split(" ")
@@ -299,14 +285,13 @@ class ZillowScraper(Scraper):
else: else:
raise ValueError(f"Unexpected state/zip format in address: {address_str}") raise ValueError(f"Unexpected state/zip format in address: {address_str}")
street_address, unit = parse_address_two(street_address) address_one, address_two = parse_address_one(address_one)
return Address( return Address(
street_address=street_address, address_one=address_one,
address_two=address_two if address_two else "#",
city=city, city=city,
unit=unit,
state=state, state=state,
zip_code=zip_code, zip_code=zip_code,
country="USA",
) )
@staticmethod @staticmethod

View File

@@ -1,9 +1,9 @@
import re import re
def parse_address_two(street_address: str) -> tuple: def parse_address_one(street_address: str) -> tuple:
if not street_address: if not street_address:
return street_address, None return street_address, "#"
apt_match = re.search( apt_match = re.search(
r"(APT\s*[\dA-Z]+|#[\dA-Z]+|UNIT\s*[\dA-Z]+|LOT\s*[\dA-Z]+|SUITE\s*[\dA-Z]+)$", r"(APT\s*[\dA-Z]+|#[\dA-Z]+|UNIT\s*[\dA-Z]+|LOT\s*[\dA-Z]+|SUITE\s*[\dA-Z]+)$",
@@ -13,36 +13,26 @@ def parse_address_two(street_address: str) -> tuple:
if apt_match: if apt_match:
apt_str = apt_match.group().strip() apt_str = apt_match.group().strip()
cleaned_apt_str = re.sub( cleaned_apt_str = re.sub(r"(APT\s*|UNIT\s*|LOT\s*|SUITE\s*)", "#", apt_str, flags=re.I)
r"(APT\s*|UNIT\s*|LOT\s*|SUITE\s*)", "#", apt_str, flags=re.I
)
main_address = street_address.replace(apt_str, "").strip() main_address = street_address.replace(apt_str, "").strip()
return main_address, cleaned_apt_str return main_address, cleaned_apt_str
else: else:
return street_address, None return street_address, "#"
def parse_unit(street_address: str): def parse_address_two(street_address: str):
if not street_address: if not street_address:
return None return "#"
apt_match = re.search( apt_match = re.search(
r"(APT\s*[\dA-Z]+|#[\dA-Z]+|UNIT\s*[\dA-Z]+|LOT\s*[\dA-Z]+)$", r"(APT\s*[\dA-Z]+|#[\dA-Z]+|UNIT\s*[\dA-Z]+|LOT\s*[\dA-Z]+|SUITE\s*[\dA-Z]+)$",
street_address, street_address,
re.I, re.I,
) )
if apt_match: if apt_match:
apt_str = apt_match.group().strip() apt_str = apt_match.group().strip()
apt_str = re.sub(r"(APT\s*|UNIT\s*|LOT\s*)", "#", apt_str, flags=re.I) apt_str = re.sub(r"(APT\s*|UNIT\s*|LOT\s*|SUITE\s*)", "#", apt_str, flags=re.I)
return apt_str return apt_str
else: else:
return None return "#"
if __name__ == "__main__":
print(parse_address_two("4303 E Cactus Rd Apt 126"))
print(parse_address_two("1234 Elm Street apt 2B"))
print(parse_address_two("1234 Elm Street UNIT 3A"))
print(parse_address_two("1234 Elm Street unit 3A"))
print(parse_address_two("1234 Elm Street SuIte 3A"))

27
poetry.lock generated
View File

@@ -106,6 +106,17 @@ files = [
{file = "colorama-0.4.6.tar.gz", hash = "sha256:08695f5cb7ed6e0531a20572697297273c47b8cae5a63ffc6d6ed5c201be6e44"}, {file = "colorama-0.4.6.tar.gz", hash = "sha256:08695f5cb7ed6e0531a20572697297273c47b8cae5a63ffc6d6ed5c201be6e44"},
] ]
[[package]]
name = "et-xmlfile"
version = "1.1.0"
description = "An implementation of lxml.xmlfile for the standard library"
optional = false
python-versions = ">=3.6"
files = [
{file = "et_xmlfile-1.1.0-py3-none-any.whl", hash = "sha256:a2ba85d1d6a74ef63837eed693bcb89c3f752169b0e3e7ae5b16ca5e1b3deada"},
{file = "et_xmlfile-1.1.0.tar.gz", hash = "sha256:8eb9e2bc2f8c97e37a2dc85a09ecdcdec9d8a396530a6d5a33b30b9a92da0c5c"},
]
[[package]] [[package]]
name = "exceptiongroup" name = "exceptiongroup"
version = "1.1.3" version = "1.1.3"
@@ -217,6 +228,20 @@ files = [
{file = "numpy-1.26.0.tar.gz", hash = "sha256:f93fc78fe8bf15afe2b8d6b6499f1c73953169fad1e9a8dd086cdff3190e7fdf"}, {file = "numpy-1.26.0.tar.gz", hash = "sha256:f93fc78fe8bf15afe2b8d6b6499f1c73953169fad1e9a8dd086cdff3190e7fdf"},
] ]
[[package]]
name = "openpyxl"
version = "3.1.2"
description = "A Python library to read/write Excel 2010 xlsx/xlsm files"
optional = false
python-versions = ">=3.6"
files = [
{file = "openpyxl-3.1.2-py2.py3-none-any.whl", hash = "sha256:f91456ead12ab3c6c2e9491cf33ba6d08357d802192379bb482f1033ade496f5"},
{file = "openpyxl-3.1.2.tar.gz", hash = "sha256:a6f5977418eff3b2d5500d54d9db50c8277a368436f4e4f8ddb1be3422870184"},
]
[package.dependencies]
et-xmlfile = "*"
[[package]] [[package]]
name = "packaging" name = "packaging"
version = "23.1" version = "23.1"
@@ -425,4 +450,4 @@ zstd = ["zstandard (>=0.18.0)"]
[metadata] [metadata]
lock-version = "2.0" lock-version = "2.0"
python-versions = "^3.10" python-versions = "^3.10"
content-hash = "eede625d6d45085e143b0af246cb2ce00cff8579c667be3b63387c8594a5570d" content-hash = "3647d568f5623dd762f19029230626a62e68309fa2ef8be49a36382c19264a5f"

View File

@@ -1,15 +1,19 @@
[tool.poetry] [tool.poetry]
name = "homeharvest" name = "homeharvest"
version = "0.2.1" version = "0.2.12"
description = "Real estate scraping library supporting Zillow, Realtor.com & Redfin." description = "Real estate scraping library supporting Zillow, Realtor.com & Redfin."
authors = ["Zachary Hampton <zachary@zacharysproducts.com>", "Cullen Watson <cullen@cullen.ai>"] authors = ["Zachary Hampton <zachary@zacharysproducts.com>", "Cullen Watson <cullen@cullen.ai>"]
homepage = "https://github.com/ZacharyHampton/HomeHarvest" homepage = "https://github.com/ZacharyHampton/HomeHarvest"
readme = "README.md" readme = "README.md"
[tool.poetry.scripts]
homeharvest = "homeharvest.cli:main"
[tool.poetry.dependencies] [tool.poetry.dependencies]
python = "^3.10" python = "^3.10"
requests = "^2.31.0" requests = "^2.31.0"
pandas = "^2.1.0" pandas = "^2.1.0"
openpyxl = "^3.1.2"
[tool.poetry.group.dev.dependencies] [tool.poetry.group.dev.dependencies]

View File

@@ -9,15 +9,9 @@ from homeharvest.exceptions import (
def test_redfin(): def test_redfin():
results = [ results = [
scrape_property( scrape_property(location="2530 Al Lipscomb Way", site_name="redfin", listing_type="for_sale"),
location="2530 Al Lipscomb Way", site_name="redfin", listing_type="for_sale" scrape_property(location="Phoenix, AZ, USA", site_name=["redfin"], listing_type="for_rent"),
), scrape_property(location="Dallas, TX, USA", site_name="redfin", listing_type="sold"),
scrape_property(
location="Phoenix, AZ, USA", site_name=["redfin"], listing_type="for_rent"
),
scrape_property(
location="Dallas, TX, USA", site_name="redfin", listing_type="sold"
),
scrape_property(location="85281", site_name="redfin"), scrape_property(location="85281", site_name="redfin"),
] ]

24
tests/test_utils.py Normal file
View File

@@ -0,0 +1,24 @@
from homeharvest.utils import parse_address_one, parse_address_two
def test_parse_address_one():
test_data = [
("4303 E Cactus Rd Apt 126", ("4303 E Cactus Rd", "#126")),
("1234 Elm Street apt 2B", ("1234 Elm Street", "#2B")),
("1234 Elm Street UNIT 3A", ("1234 Elm Street", "#3A")),
("1234 Elm Street unit 3A", ("1234 Elm Street", "#3A")),
("1234 Elm Street SuIte 3A", ("1234 Elm Street", "#3A")),
]
for input_data, (exp_addr_one, exp_addr_two) in test_data:
address_one, address_two = parse_address_one(input_data)
assert address_one == exp_addr_one
assert address_two == exp_addr_two
def test_parse_address_two():
test_data = [("Apt 126", "#126"), ("apt 2B", "#2B"), ("UNIT 3A", "#3A"), ("unit 3A", "#3A"), ("SuIte 3A", "#3A")]
for input_data, expected in test_data:
output = parse_address_two(input_data)
assert output == expected

View File

@@ -9,15 +9,9 @@ from homeharvest.exceptions import (
def test_zillow(): def test_zillow():
results = [ results = [
scrape_property( scrape_property(location="2530 Al Lipscomb Way", site_name="zillow", listing_type="for_sale"),
location="2530 Al Lipscomb Way", site_name="zillow", listing_type="for_sale" scrape_property(location="Phoenix, AZ, USA", site_name=["zillow"], listing_type="for_rent"),
), scrape_property(location="Dallas, TX, USA", site_name="zillow", listing_type="sold"),
scrape_property(
location="Phoenix, AZ, USA", site_name=["zillow"], listing_type="for_rent"
),
scrape_property(
location="Dallas, TX, USA", site_name="zillow", listing_type="sold"
),
scrape_property(location="85281", site_name="zillow"), scrape_property(location="85281", site_name="zillow"),
] ]