Compare commits

...

33 Commits

Author SHA1 Message Date
Zachary Hampton
59317fd6fc Merge pull request #25 from ZacharyHampton/fix/recent-issues
Fix/recent issues
2023-09-28 18:27:04 -07:00
Zachary Hampton
928b431d1f - bump version 2023-09-28 18:25:53 -07:00
Zachary Hampton
896f862137 - zillow flow update 2023-09-28 18:25:47 -07:00
Zachary Hampton
3174f5076c Merge pull request #23 from ZacharyHampton/fix/recent-issues
Fixes & Changes for recent issues
2023-09-28 18:07:55 -07:00
Zachary Hampton
2abbb913a8 - convert posted_time to datetime
- zillow location bug fix
2023-09-28 18:07:42 -07:00
Cullen Watson
73b6d5b33f [fix] zilow tls client 2023-09-28 19:34:01 -05:00
Zachary Hampton
da39c989d9 - version bump 2023-09-28 15:27:36 -07:00
Zachary Hampton
01c53f9399 - redfin bug fix
- add recent features for issues
2023-09-28 15:19:43 -07:00
Zachary Hampton
9200c17df2 - version bump 2023-09-23 10:55:50 -07:00
Zachary Hampton
9e262bf214 Merge remote-tracking branch 'origin/master' 2023-09-23 10:55:29 -07:00
Zachary Hampton
82f78fb578 - zillow bug fix 2023-09-23 10:55:14 -07:00
Cullen Watson
b0e40df00a Update pyproject.toml 2023-09-22 09:51:24 -05:00
Cullen Watson
2fc40e0dad fix: cookie 2023-09-22 09:47:37 -05:00
Zachary Hampton
254f3a68a1 - redfin bug fix 2023-09-21 18:54:03 -07:00
Zachary Hampton
05713c76b0 - redfin bug fix
- .get
2023-09-21 11:27:12 -07:00
Cullen Watson
9120cc9bfe fix: remove line 2023-09-21 13:10:14 -05:00
Cullen Watson
eee4b19515 Merge branch 'master' of https://github.com/ZacharyHampton/HomeHarvest 2023-09-21 13:06:15 -05:00
Cullen Watson
c25961eded fix: KeyEror : [minBaths] 2023-09-21 13:06:06 -05:00
Zachary Hampton
0884c3d163 Update README.md 2023-09-21 09:55:29 -07:00
Cullen Watson
8f37bfdeb8 chore: version number 2023-09-21 11:19:23 -05:00
Cullen Watson
48c2338276 fix: keyerror 2023-09-21 11:18:37 -05:00
Cullen Watson
f58a1f4a74 docs: tryhomeharvest.com 2023-09-21 10:57:11 -05:00
Zachary Hampton
4cef926d7d Merge pull request #14 from ZacharyHampton/keep_duplicates_flag
Keep duplicates flag
2023-09-20 20:27:08 -07:00
Cullen Watson
e82eeaa59f docs: add keep duplicates flag 2023-09-20 20:25:50 -05:00
Cullen Watson
644f16b25b feat: keep duplicates flag 2023-09-20 20:24:18 -05:00
Cullen Watson
e9ddc6df92 docs: update tutorial vid for release v0.2.7 2023-09-19 22:18:49 -05:00
Cullen Watson
50fb1c391d docs: update property schema 2023-09-19 21:35:37 -05:00
Cullen Watson
4f91f9dadb chore: version number 2023-09-19 21:17:12 -05:00
Zachary Hampton
66e55173b1 Merge pull request #13 from ZacharyHampton/simplify_fields
fix: simplify fields
2023-09-19 19:16:18 -07:00
Cullen Watson
f6054e8746 fix: simplify fields 2023-09-19 21:13:20 -05:00
Cullen Watson
e8d9235ee6 chore: update version number 2023-09-19 16:43:59 -05:00
Cullen Watson
043f091158 fix: keyerror on address 2023-09-19 16:43:17 -05:00
Cullen Watson
eae8108978 docs: change cmd 2023-09-19 16:18:01 -05:00
15 changed files with 457 additions and 410 deletions

View File

@@ -4,20 +4,26 @@
[![Try with Replit](https://replit.com/badge?caption=Try%20with%20Replit)](https://replit.com/@ZacharyHampton/HomeHarvestDemo)
\
**Not technical?** Try out the web scraping tool on our site at [tryhomeharvest.com](https://tryhomeharvest.com).
*Looking to build a data-focused software product?* **[Book a call](https://calendly.com/zachary-products/15min)** *to work with us.*
Check out another project we wrote: ***[JobSpy](https://github.com/cullenwatson/JobSpy)** a Python package for job scraping*
## Features
- Scrapes properties from **Zillow**, **Realtor.com** & **Redfin** simultaneously
- Aggregates the properties in a Pandas DataFrame
[Video Guide for HomeHarvest](https://www.youtube.com/watch?v=HCoHoiJdWQY)
[Video Guide for HomeHarvest](https://youtu.be/JnV7eR2Ve2o) - _updated for release v0.2.7_
![homeharvest](https://github.com/ZacharyHampton/HomeHarvest/assets/78247585/b3d5d727-e67b-4a9f-85d8-1e65fd18620a)
## Installation
```bash
pip install --force-reinstall homeharvest
pip install homeharvest
```
_Python version >= [3.10](https://www.python.org/downloads/release/python-3100/) required_
@@ -37,6 +43,7 @@ By default:
- The `-o` or `--output` default format is `excel`. Options are `csv` or `excel`.
- If `-f` or `--filename` is left blank, the default is `HomeHarvest_<current_timestamp>`.
- If `-p` or `--proxy` is not provided, the scraper uses the local IP.
- Use `-k` or `--keep_duplicates` to keep duplicate properties based on address. If not provided, duplicates will be removed.
### Python
```py
@@ -71,8 +78,9 @@ Required
├── location (str): address in various formats e.g. just zip, full address, city/state, etc.
└── listing_type (enum): for_rent, for_sale, sold
Optional
├── site_name (List[enum], default=all three sites): zillow, realtor.com, redfin
├── site_name (list[enum], default=all three sites): zillow, realtor.com, redfin
├── proxy (str): in format 'http://user:pass@host:port' or [https, socks]
└── keep_duplicates (bool, default=False): whether to keep or remove duplicate properties based on address
```
### Property Schema
@@ -81,7 +89,7 @@ Property
├── Basic Information:
│ ├── property_url (str)
│ ├── site_name (enum): zillow, redfin, realtor.com
│ ├── listing_type (enum: ListingType)
│ ├── listing_type (enum): for_sale, for_rent, sold
│ └── property_type (enum): house, apartment, condo, townhouse, single_family, multi_family, building
├── Address Details:
@@ -92,45 +100,38 @@ Property
│ ├── unit (str)
│ └── country (str)
├── Property Features:
│ ├── price (int)
├── House for Sale Features:
│ ├── tax_assessed_value (int)
│ ├── currency (str)
│ ├── square_feet (int)
│ ├── beds (int)
│ ├── baths (float)
│ ├── lot_area_value (float)
│ ├── lot_area_unit (str)
│ ├── stories (int)
── year_built (int)
── year_built (int)
│ └── price_per_sqft (int)
├── Building for Sale and Apartment Details:
│ ├── bldg_name (str)
│ ├── beds_min (int)
│ ├── beds_max (int)
│ ├── baths_min (float)
│ ├── baths_max (float)
│ ├── sqft_min (int)
│ ├── sqft_max (int)
│ ├── price_min (int)
│ ├── price_max (int)
│ ├── area_min (int)
│ └── unit_count (int)
├── Miscellaneous Details:
│ ├── price_per_sqft (int)
│ ├── mls_id (str)
│ ├── agent_name (str)
│ ├── img_src (str)
│ ├── description (str)
│ ├── status_text (str)
── latitude (float)
│ ├── longitude (float)
│ └── posted_time (str) [Only for Zillow]
── posted_time (str)
── Building Details (for property_type: building):
├── bldg_name (str)
── bldg_unit_count (int)
│ ├── bldg_min_beds (int)
│ ├── bldg_min_baths (float)
│ └── bldg_min_area (int)
└── Apartment Details (for property type: apartment):
├── apt_min_beds: int
├── apt_max_beds: int
├── apt_min_baths: float
├── apt_max_baths: float
├── apt_min_price: int
├── apt_max_price: int
├── apt_min_sqft: int
├── apt_max_sqft: int
── Location Details:
├── latitude (float)
── longitude (float)
```
## Supported Countries for Property Scraping
@@ -144,7 +145,7 @@ The following exceptions may be raised when using HomeHarvest:
- `InvalidSite` - valid options: `zillow`, `redfin`, `realtor.com`
- `InvalidListingType` - valid options: `for_sale`, `for_rent`, `sold`
- `NoResultsFound` - no properties found from your input
- `GeoCoordsNotFound` - if Zillow scraper is not able to create geo-coordinates from the location you input
- `GeoCoordsNotFound` - if Zillow scraper is not able to derive geo-coordinates from the location you input
## Frequently Asked Questions

View File

@@ -23,9 +23,7 @@ def _validate_input(site_name: str, listing_type: str) -> None:
raise InvalidSite(f"Provided site, '{site_name}', does not exist.")
if listing_type.upper() not in ListingType.__members__:
raise InvalidListingType(
f"Provided listing type, '{listing_type}', does not exist."
)
raise InvalidListingType(f"Provided listing type, '{listing_type}', does not exist.")
def _get_ordered_properties(result: Property) -> list[str]:
@@ -35,38 +33,34 @@ def _get_ordered_properties(result: Property) -> list[str]:
"listing_type",
"property_type",
"status_text",
"currency",
"price",
"apt_min_price",
"apt_max_price",
"apt_min_sqft",
"apt_max_sqft",
"apt_min_beds",
"apt_max_beds",
"apt_min_baths",
"apt_max_baths",
"baths_min",
"baths_max",
"beds_min",
"beds_max",
"sqft_min",
"sqft_max",
"price_min",
"price_max",
"unit_count",
"tax_assessed_value",
"square_feet",
"price_per_sqft",
"beds",
"baths",
"lot_area_value",
"lot_area_unit",
"street_address",
"unit",
"address_one",
"address_two",
"city",
"state",
"zip_code",
"country",
"posted_time",
"bldg_min_beds",
"bldg_min_baths",
"bldg_min_area",
"bldg_unit_count",
"area_min",
"bldg_name",
"stories",
"year_built",
"agent_name",
"agent_phone",
"agent_email",
"days_on_market",
"sold_date",
"mls_id",
"img_src",
"latitude",
@@ -86,24 +80,33 @@ def _process_result(result: Property) -> pd.DataFrame:
prop_data["property_type"] = None
if "address" in prop_data:
address_data = prop_data["address"]
prop_data["street_address"] = address_data.street_address
prop_data["unit"] = address_data.unit
prop_data["address_one"] = address_data.address_one
prop_data["address_two"] = address_data.address_two
prop_data["city"] = address_data.city
prop_data["state"] = address_data.state
prop_data["zip_code"] = address_data.zip_code
prop_data["country"] = address_data.country
del prop_data["address"]
if "agent" in prop_data and prop_data["agent"] is not None:
agent_data = prop_data["agent"]
prop_data["agent_name"] = agent_data.name
prop_data["agent_phone"] = agent_data.phone
prop_data["agent_email"] = agent_data.email
del prop_data["agent"]
else:
prop_data["agent_name"] = None
prop_data["agent_phone"] = None
prop_data["agent_email"] = None
properties_df = pd.DataFrame([prop_data])
properties_df = properties_df[_get_ordered_properties(result)]
return properties_df
def _scrape_single_site(
location: str, site_name: str, listing_type: str, proxy: str = None
) -> pd.DataFrame:
def _scrape_single_site(location: str, site_name: str, listing_type: str, proxy: str = None) -> pd.DataFrame:
"""
Helper function to scrape a single site.
"""
@@ -120,9 +123,7 @@ def _scrape_single_site(
results = site.search()
properties_dfs = [_process_result(result) for result in results]
properties_dfs = [
df.dropna(axis=1, how="all") for df in properties_dfs if not df.empty
]
properties_dfs = [df.dropna(axis=1, how="all") for df in properties_dfs if not df.empty]
if not properties_dfs:
return pd.DataFrame()
@@ -134,6 +135,7 @@ def scrape_property(
site_name: Union[str, list[str]] = None,
listing_type: str = "for_sale",
proxy: str = None,
keep_duplicates: bool = False
) -> pd.DataFrame:
"""
Scrape property from various sites from a given location and listing type.
@@ -158,9 +160,7 @@ def scrape_property(
else:
with ThreadPoolExecutor() as executor:
futures = {
executor.submit(
_scrape_single_site, location, s_name, listing_type, proxy
): s_name
executor.submit(_scrape_single_site, location, s_name, listing_type, proxy): s_name
for s_name in site_name
}
@@ -175,14 +175,13 @@ def scrape_property(
final_df = pd.concat(results, ignore_index=True)
columns_to_track = ["street_address", "city", "unit"]
columns_to_track = ["address_one", "address_two", "city"]
#: validate they exist, otherwise create them
for col in columns_to_track:
if col not in final_df.columns:
final_df[col] = None
final_df = final_df.drop_duplicates(
subset=["street_address", "city", "unit"], keep="first"
)
if not keep_duplicates:
final_df = final_df.drop_duplicates(subset=columns_to_track, keep="first")
return final_df

View File

@@ -5,9 +5,7 @@ from homeharvest import scrape_property
def main():
parser = argparse.ArgumentParser(description="Home Harvest Property Scraper")
parser.add_argument(
"location", type=str, help="Location to scrape (e.g., San Francisco, CA)"
)
parser.add_argument("location", type=str, help="Location to scrape (e.g., San Francisco, CA)")
parser.add_argument(
"-s",
@@ -45,14 +43,17 @@ def main():
)
parser.add_argument(
"-p", "--proxy", type=str, default=None, help="Proxy to use for scraping"
"-k",
"--keep_duplicates",
action="store_true",
help="Keep duplicate properties based on address"
)
parser.add_argument("-p", "--proxy", type=str, default=None, help="Proxy to use for scraping")
args = parser.parse_args()
result = scrape_property(
args.location, args.site_name, args.listing_type, proxy=args.proxy
)
result = scrape_property(args.location, args.site_name, args.listing_type, proxy=args.proxy, keep_duplicates=args.keep_duplicates)
if not args.filename:
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")

View File

@@ -19,10 +19,7 @@ class Scraper:
self.session = requests.Session()
if scraper_input.proxy:
proxy_url = scraper_input.proxy
proxies = {
"http": proxy_url,
"https": proxy_url
}
proxies = {"http": proxy_url, "https": proxy_url}
self.session.proxies.update(proxies)
self.listing_type = scraper_input.listing_type
self.site_name = scraper_input.site_name

View File

@@ -1,5 +1,7 @@
from dataclasses import dataclass
from enum import Enum
from typing import Tuple
from datetime import datetime
class SiteName(Enum):
@@ -56,12 +58,18 @@ class PropertyType(Enum):
@dataclass
class Address:
street_address: str
city: str
state: str
zip_code: str
unit: str | None = None
country: str | None = None
address_one: str | None = None
address_two: str | None = "#"
city: str | None = None
state: str | None = None
zip_code: str | None = None
@dataclass
class Agent:
name: str
phone: str | None = None
email: str | None = None
@dataclass
@@ -73,12 +81,7 @@ class Property:
property_type: PropertyType | None = None
# house for sale
price: int | None = None
tax_assessed_value: int | None = None
currency: str | None = None
square_feet: int | None = None
beds: int | None = None
baths: float | None = None
lot_area_value: float | None = None
lot_area_unit: str | None = None
stories: int | None = None
@@ -86,27 +89,32 @@ class Property:
price_per_sqft: int | None = None
mls_id: str | None = None
agent_name: str | None = None
agent: Agent | None = None
img_src: str | None = None
description: str | None = None
status_text: str | None = None
latitude: float | None = None
longitude: float | None = None
posted_time: str | None = None
posted_time: datetime | None = None
# building for sale
bldg_name: str | None = None
bldg_unit_count: int | None = None
bldg_min_beds: int | None = None
bldg_min_baths: float | None = None
bldg_min_area: int | None = None
area_min: int | None = None
# apt
apt_min_beds: int | None = None
apt_max_beds: int | None = None
apt_min_baths: float | None = None
apt_max_baths: float | None = None
apt_min_price: int | None = None
apt_max_price: int | None = None
apt_min_sqft: int | None = None
apt_max_sqft: int | None = None
beds_min: int | None = None
beds_max: int | None = None
baths_min: float | None = None
baths_max: float | None = None
sqft_min: int | None = None
sqft_max: int | None = None
price_min: int | None = None
price_max: int | None = None
unit_count: int | None = None
latitude: float | None = None
longitude: float | None = None
sold_date: datetime | None = None
days_on_market: int | None = None

View File

@@ -1,16 +1,23 @@
import json
"""
homeharvest.realtor.__init__
~~~~~~~~~~~~
This module implements the scraper for relator.com
"""
from ..models import Property, Address
from .. import Scraper
from typing import Any, Generator
from ....exceptions import NoResultsFound
from ....utils import parse_address_two, parse_unit
from ....utils import parse_address_one, parse_address_two
from concurrent.futures import ThreadPoolExecutor, as_completed
class RealtorScraper(Scraper):
def __init__(self, scraper_input):
self.counter = 1
super().__init__(scraper_input)
self.search_url = "https://www.realtor.com/api/v1/rdc_search_srp?client_id=rdc-search-new-communities&schema=vesta"
self.search_url = (
"https://www.realtor.com/api/v1/rdc_search_srp?client_id=rdc-search-new-communities&schema=vesta"
)
def handle_location(self):
headers = {
@@ -50,6 +57,9 @@ class RealtorScraper(Scraper):
return result[0]
def handle_address(self, property_id: str) -> list[Property]:
"""
Handles a specific address & returns one property
"""
query = """query Property($property_id: ID!) {
property(id: $property_id) {
property_id
@@ -108,43 +118,45 @@ class RealtorScraper(Scraper):
response_json = response.json()
property_info = response_json["data"]["property"]
street_address, unit = parse_address_two(property_info["address"]["line"])
address_one, address_two = parse_address_one(property_info["address"]["line"])
return [
Property(
site_name=self.site_name,
address=Address(
street_address=street_address,
address_one=address_one,
address_two=address_two,
city=property_info["address"]["city"],
state=property_info["address"]["state_code"],
zip_code=property_info["address"]["postal_code"],
unit=unit,
country="USA",
),
property_url="https://www.realtor.com/realestateandhomes-detail/"
+ property_info["details"]["permalink"],
beds=property_info["basic"]["beds"],
baths=property_info["basic"]["baths"],
stories=property_info["details"]["stories"],
year_built=property_info["details"]["year_built"],
square_feet=property_info["basic"]["sqft"],
price_per_sqft=property_info["basic"]["price"]
// property_info["basic"]["sqft"]
if property_info["basic"]["sqft"] is not None
and property_info["basic"]["price"] is not None
price_per_sqft=property_info["basic"]["price"] // property_info["basic"]["sqft"]
if property_info["basic"]["sqft"] is not None and property_info["basic"]["price"] is not None
else None,
price=property_info["basic"]["price"],
mls_id=property_id,
listing_type=self.listing_type,
lot_area_value=property_info["public_record"]["lot_size"]
if property_info["public_record"] is not None
else None,
beds_min=property_info["basic"]["beds"],
beds_max=property_info["basic"]["beds"],
baths_min=property_info["basic"]["baths"],
baths_max=property_info["basic"]["baths"],
sqft_min=property_info["basic"]["sqft"],
sqft_max=property_info["basic"]["sqft"],
price_min=property_info["basic"]["price"],
price_max=property_info["basic"]["price"],
)
]
def handle_area(
self, variables: dict, return_total: bool = False
) -> list[Property] | int:
def handle_area(self, variables: dict, return_total: bool = False) -> list[Property] | int:
"""
Handles a location area & returns a list of properties
"""
query = (
"""query Home_search(
$city: String,
@@ -237,17 +249,15 @@ class RealtorScraper(Scraper):
return []
for result in response_json["data"]["home_search"]["results"]:
street_address, unit = parse_address_two(
result["location"]["address"]["line"]
)
self.counter += 1
address_one, _ = parse_address_one(result["location"]["address"]["line"])
realty_property = Property(
address=Address(
street_address=street_address,
address_one=address_one,
city=result["location"]["address"]["city"],
state=result["location"]["address"]["state_code"],
zip_code=result["location"]["address"]["postal_code"],
unit=parse_unit(result["location"]["address"]["unit"]),
country="USA",
address_two=parse_address_two(result["location"]["address"]["unit"]),
),
latitude=result["location"]["address"]["coordinate"]["lat"]
if result
@@ -264,20 +274,22 @@ class RealtorScraper(Scraper):
and "lon" in result["location"]["address"]["coordinate"]
else None,
site_name=self.site_name,
property_url="https://www.realtor.com/realestateandhomes-detail/"
+ result["property_id"],
beds=result["description"]["beds"],
baths=result["description"]["baths"],
property_url="https://www.realtor.com/realestateandhomes-detail/" + result["property_id"],
stories=result["description"]["stories"],
year_built=result["description"]["year_built"],
square_feet=result["description"]["sqft"],
price_per_sqft=result["price_per_sqft"],
price=result["list_price"],
mls_id=result["property_id"],
listing_type=self.listing_type,
lot_area_value=result["description"]["lot_sqft"],
beds_min=result["description"]["beds"],
beds_max=result["description"]["beds"],
baths_min=result["description"]["baths"],
baths_max=result["description"]["baths"],
sqft_min=result["description"]["sqft"],
sqft_max=result["description"]["sqft"],
price_min=result["list_price"],
price_max=result["list_price"],
)
properties.append(realty_property)
return properties

View File

@@ -1,9 +1,16 @@
"""
homeharvest.redfin.__init__
~~~~~~~~~~~~
This module implements the scraper for redfin.com
"""
import json
from typing import Any
from .. import Scraper
from ....utils import parse_address_two, parse_unit
from ..models import Property, Address, PropertyType, ListingType, SiteName
from ....exceptions import NoResultsFound
from ....utils import parse_address_two, parse_address_one
from ..models import Property, Address, PropertyType, ListingType, SiteName, Agent
from ....exceptions import NoResultsFound, SearchTooBroad
from datetime import datetime
class RedfinScraper(Scraper):
@@ -12,9 +19,7 @@ class RedfinScraper(Scraper):
self.listing_type = scraper_input.listing_type
def _handle_location(self):
url = "https://www.redfin.com/stingray/do/location-autocomplete?v=2&al=1&location={}".format(
self.location
)
url = "https://www.redfin.com/stingray/do/location-autocomplete?v=2&al=1&location={}".format(self.location)
response = self.session.get(url)
response_json = json.loads(response.text.replace("{}&&", ""))
@@ -26,11 +31,11 @@ class RedfinScraper(Scraper):
return "6" #: city
elif match_type == "1":
return "address" #: address, needs to be handled differently
elif match_type == "11":
return "state"
if "exactMatch" not in response_json["payload"]:
raise NoResultsFound(
"No results found for location: {}".format(self.location)
)
raise NoResultsFound("No results found for location: {}".format(self.location))
if response_json["payload"]["exactMatch"] is not None:
target = response_json["payload"]["exactMatch"]
@@ -45,67 +50,63 @@ class RedfinScraper(Scraper):
return home[key]["value"]
if not single_search:
street_address, unit = parse_address_two(get_value("streetLine"))
unit = parse_unit(get_value("streetLine"))
address = Address(
street_address=street_address,
city=home["city"],
state=home["state"],
zip_code=home["zip"],
unit=unit,
country="USA",
address_one=parse_address_one(get_value("streetLine"))[0],
address_two=parse_address_one(get_value("streetLine"))[1],
city=home.get("city"),
state=home.get("state"),
zip_code=home.get("zip"),
)
else:
address_info = home["streetAddress"]
street_address, unit = parse_address_two(address_info["assembledAddress"])
address_info = home.get("streetAddress")
address_one, address_two = parse_address_one(address_info.get("assembledAddress"))
address = Address(
street_address=street_address,
city=home["city"],
state=home["state"],
zip_code=home["zip"],
unit=unit,
country="USA",
address_one=address_one,
address_two=address_two,
city=home.get("city"),
state=home.get("state"),
zip_code=home.get("zip"),
)
url = "https://www.redfin.com{}".format(home["url"])
#: property_type = home["propertyType"] if "propertyType" in home else None
lot_size_data = home.get("lotSize")
if not isinstance(lot_size_data, int):
lot_size = (
lot_size_data.get("value", None)
if isinstance(lot_size_data, dict)
else None
)
lot_size = lot_size_data.get("value", None) if isinstance(lot_size_data, dict) else None
else:
lot_size = lot_size_data
lat_long = get_value("latLong")
return Property(
site_name=self.site_name,
listing_type=self.listing_type,
address=address,
property_url=url,
beds=home["beds"] if "beds" in home else None,
baths=home["baths"] if "baths" in home else None,
beds_min=home["beds"] if "beds" in home else None,
beds_max=home["beds"] if "beds" in home else None,
baths_min=home["baths"] if "baths" in home else None,
baths_max=home["baths"] if "baths" in home else None,
price_min=get_value("price"),
price_max=get_value("price"),
sqft_min=get_value("sqFt"),
sqft_max=get_value("sqFt"),
stories=home["stories"] if "stories" in home else None,
agent_name=get_value("listingAgent"),
agent=Agent( #: listingAgent, some have sellingAgent as well
name=home['listingAgent'].get('name') if 'listingAgent' in home else None,
phone=home['listingAgent'].get('phone') if 'listingAgent' in home else None,
),
description=home["listingRemarks"] if "listingRemarks" in home else None,
year_built=get_value("yearBuilt")
if not single_search
else home["yearBuilt"],
square_feet=get_value("sqFt"),
year_built=get_value("yearBuilt") if not single_search else home.get("yearBuilt"),
lot_area_value=lot_size,
property_type=PropertyType.from_int_code(home.get("propertyType")),
price_per_sqft=get_value("pricePerSqFt"),
price=get_value("price"),
price_per_sqft=get_value("pricePerSqFt") if type(home.get("pricePerSqFt")) != int else home.get("pricePerSqFt"),
mls_id=get_value("mlsId"),
latitude=home["latLong"]["latitude"]
if "latLong" in home and "latitude" in home["latLong"]
else None,
longitude=home["latLong"]["longitude"]
if "latLong" in home and "longitude" in home["latLong"]
else None,
latitude=lat_long.get('latitude') if lat_long else None,
longitude=lat_long.get('longitude') if lat_long else None,
sold_date=datetime.fromtimestamp(home['soldDate'] / 1000) if 'soldDate' in home else None,
days_on_market=get_value("dom")
)
def _handle_rentals(self, region_id, region_type):
@@ -125,12 +126,10 @@ class RedfinScraper(Scraper):
address_info = home_data.get("addressInfo", {})
centroid = address_info.get("centroid", {}).get("centroid", {})
address = Address(
street_address=address_info.get("formattedStreetLine", None),
city=address_info.get("city", None),
state=address_info.get("state", None),
zip_code=address_info.get("zip", None),
unit=None,
country="US" if address_info.get("countryCode", None) == 1 else None,
address_one=parse_address_one(address_info.get("formattedStreetLine"))[0],
city=address_info.get("city"),
state=address_info.get("state"),
zip_code=address_info.get("zip"),
)
price_range = rental_data.get("rentPriceRange", {"min": None, "max": None})
@@ -143,20 +142,20 @@ class RedfinScraper(Scraper):
site_name=SiteName.REDFIN,
listing_type=ListingType.FOR_RENT,
address=address,
apt_min_beds=bed_range.get("min", None),
apt_min_baths=bath_range.get("min", None),
apt_max_beds=bed_range.get("max", None),
apt_max_baths=bath_range.get("max", None),
description=rental_data.get("description", None),
latitude=centroid.get("latitude", None),
longitude=centroid.get("longitude", None),
apt_min_price=price_range.get("min", None),
apt_max_price=price_range.get("max", None),
apt_min_sqft=sqft_range.get("min", None),
apt_max_sqft=sqft_range.get("max", None),
img_src=home_data.get("staticMapUrl", None),
posted_time=rental_data.get("lastUpdated", None),
bldg_name=rental_data.get("propertyName", None),
description=rental_data.get("description"),
latitude=centroid.get("latitude"),
longitude=centroid.get("longitude"),
baths_min=bath_range.get("min"),
baths_max=bath_range.get("max"),
beds_min=bed_range.get("min"),
beds_max=bed_range.get("max"),
price_min=price_range.get("min"),
price_max=price_range.get("max"),
sqft_min=sqft_range.get("min"),
sqft_max=sqft_range.get("max"),
img_src=home_data.get("staticMapUrl"),
posted_time=rental_data.get("lastUpdated"),
bldg_name=rental_data.get("propertyName"),
)
properties_list.append(property_)
@@ -175,16 +174,15 @@ class RedfinScraper(Scraper):
building["address"]["streetType"],
]
)
street_address, unit = parse_address_two(street_address)
return Property(
site_name=self.site_name,
property_type=PropertyType("BUILDING"),
address=Address(
street_address=street_address,
address_one=parse_address_one(street_address)[0],
city=building["address"]["city"],
state=building["address"]["stateOrProvinceCode"],
zip_code=building["address"]["postalCode"],
unit=parse_unit(
address_two=parse_address_two(
" ".join(
[
building["address"]["unitType"],
@@ -195,7 +193,7 @@ class RedfinScraper(Scraper):
),
property_url="https://www.redfin.com{}".format(building["url"]),
listing_type=self.listing_type,
bldg_unit_count=building["numUnitsForSale"],
unit_count=building.get("numUnitsForSale"),
)
def handle_address(self, home_id: str):
@@ -206,7 +204,6 @@ class RedfinScraper(Scraper):
https://www.redfin.com/stingray/api/home/details/aboveTheFold?propertyId=147337694&accessLevel=3
https://www.redfin.com/stingray/api/home/details/belowTheFold?propertyId=147337694&accessLevel=3
"""
url = "https://www.redfin.com/stingray/api/home/details/aboveTheFold?propertyId={}&accessLevel=3".format(
home_id
)
@@ -214,14 +211,15 @@ class RedfinScraper(Scraper):
response = self.session.get(url)
response_json = json.loads(response.text.replace("{}&&", ""))
parsed_home = self._parse_home(
response_json["payload"]["addressSectionInfo"], single_search=True
)
parsed_home = self._parse_home(response_json["payload"]["addressSectionInfo"], single_search=True)
return [parsed_home]
def search(self):
region_id, region_type = self._handle_location()
if region_type == "state":
raise SearchTooBroad("State searches are not supported, please use a more specific location.")
if region_type == "address":
home_id = region_id
return self.handle_address(home_id)
@@ -235,10 +233,14 @@ class RedfinScraper(Scraper):
url = f"https://www.redfin.com/stingray/api/gis?al=1&region_id={region_id}&region_type={region_type}&sold_within_days=30&num_homes=100000"
response = self.session.get(url)
response_json = json.loads(response.text.replace("{}&&", ""))
homes = [
self._parse_home(home) for home in response_json["payload"]["homes"]
] + [
self._parse_building(building)
for building in response_json["payload"]["buildings"].values()
]
return homes
if "payload" in response_json:
homes_list = response_json["payload"].get("homes", [])
buildings_list = response_json["payload"].get("buildings", {}).values()
homes = [self._parse_home(home) for home in homes_list] + [
self._parse_building(building) for building in buildings_list
]
return homes
else:
return []

View File

@@ -1,40 +1,73 @@
"""
homeharvest.zillow.__init__
~~~~~~~~~~~~
This module implements the scraper for zillow.com
"""
import re
import json
import tls_client
from .. import Scraper
from ....utils import parse_address_two, parse_unit
from requests.exceptions import HTTPError
from ....utils import parse_address_one, parse_address_two
from ....exceptions import GeoCoordsNotFound, NoResultsFound
from ..models import Property, Address, ListingType, PropertyType
from ..models import Property, Address, ListingType, PropertyType, Agent
import urllib.parse
from datetime import datetime, timedelta
class ZillowScraper(Scraper):
def __init__(self, scraper_input):
super().__init__(scraper_input)
self.session = tls_client.Session(
client_identifier="chrome112", random_tls_extension_order=True
)
self.session.headers.update({
'authority': 'www.zillow.com',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'max-age=0',
'sec-ch-ua': '"Chromium";v="117", "Not)A;Brand";v="24", "Google Chrome";v="117"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',
})
if not self.is_plausible_location(self.location):
raise NoResultsFound("Invalid location input: {}".format(self.location))
if self.listing_type == ListingType.FOR_SALE:
self.url = f"https://www.zillow.com/homes/for_sale/{self.location}_rb/"
elif self.listing_type == ListingType.FOR_RENT:
self.url = f"https://www.zillow.com/homes/for_rent/{self.location}_rb/"
else:
self.url = f"https://www.zillow.com/homes/recently_sold/{self.location}_rb/"
listing_type_to_url_path = {
ListingType.FOR_SALE: "for_sale",
ListingType.FOR_RENT: "for_rent",
ListingType.SOLD: "recently_sold",
}
self.url = f"https://www.zillow.com/homes/{listing_type_to_url_path[self.listing_type]}/{self.location}_rb/"
def is_plausible_location(self, location: str) -> bool:
url = (
"https://www.zillowstatic.com/autocomplete/v3/suggestions?q={"
"}&abKey=6666272a-4b99-474c-b857-110ec438732b&clientId=homepage-render"
).format(location)
).format(urllib.parse.quote(location))
response = self.session.get(url)
resp = self.session.get(url)
return response.json()["results"] != []
return resp.json()["results"] != []
def search(self):
resp = self.session.get(
self.url, headers=self._get_headers()
)
resp.raise_for_status()
resp = self.session.get(self.url)
if resp.status_code != 200:
raise HTTPError(
f"bad response status code: {resp.status_code}"
)
content = resp.text
match = re.search(
@@ -43,9 +76,7 @@ class ZillowScraper(Scraper):
re.DOTALL,
)
if not match:
raise NoResultsFound(
"No results were found for Zillow with the given Location."
)
raise NoResultsFound("No results were found for Zillow with the given Location.")
json_str = match.group(1)
data = json.loads(json_str)
@@ -130,13 +161,23 @@ class ZillowScraper(Scraper):
"wants": {"cat1": ["mapResults"]},
"isDebugRequest": False,
}
resp = self.session.put(
url, headers=self._get_headers(), json=payload
)
resp.raise_for_status()
a = resp.json()
resp = self.session.put(url, json=payload)
if resp.status_code != 200:
raise HTTPError(
f"bad response status code: {resp.status_code}"
)
return self._parse_properties(resp.json())
@staticmethod
def parse_posted_time(time: str) -> datetime:
int_time = int(time.split(" ")[0])
if "hour" in time:
return datetime.now() - timedelta(hours=int_time)
if "day" in time:
return datetime.now() - timedelta(days=int_time)
def _parse_properties(self, property_data: dict):
mapresults = property_data["cat1"]["searchResults"]["mapResults"]
@@ -146,87 +187,70 @@ class ZillowScraper(Scraper):
if "hdpData" in result:
home_info = result["hdpData"]["homeInfo"]
address_data = {
"street_address": parse_address_two(home_info["streetAddress"])[0],
"unit": parse_unit(home_info["unit"])
if "unit" in home_info
else None,
"city": home_info["city"],
"state": home_info["state"],
"zip_code": home_info["zipcode"],
"country": home_info["country"],
"address_one": parse_address_one(home_info.get("streetAddress"))[0],
"address_two": parse_address_two(home_info["unit"]) if "unit" in home_info else "#",
"city": home_info.get("city"),
"state": home_info.get("state"),
"zip_code": home_info.get("zipcode"),
}
property_data = {
"site_name": self.site_name,
"address": Address(**address_data),
"property_url": f"https://www.zillow.com{result['detailUrl']}",
"beds": int(home_info["bedrooms"])
if "bedrooms" in home_info
else None,
"baths": home_info.get("bathrooms"),
"square_feet": int(home_info["livingArea"])
if "livingArea" in home_info
else None,
"currency": home_info["currency"],
"price": home_info.get("price"),
"tax_assessed_value": int(home_info["taxAssessedValue"])
if "taxAssessedValue" in home_info
else None,
"property_type": PropertyType(home_info["homeType"]),
"listing_type": ListingType(
home_info["statusType"]
if "statusType" in home_info
else self.listing_type
property_obj = Property(
site_name=self.site_name,
address=Address(**address_data),
property_url=f"https://www.zillow.com{result['detailUrl']}",
tax_assessed_value=int(home_info["taxAssessedValue"]) if "taxAssessedValue" in home_info else None,
property_type=PropertyType(home_info.get("homeType")),
listing_type=ListingType(
home_info["statusType"] if "statusType" in home_info else self.listing_type
),
"lot_area_value": round(home_info["lotAreaValue"], 2)
if "lotAreaValue" in home_info
else None,
"lot_area_unit": home_info.get("lotAreaUnit"),
"latitude": result["latLong"]["latitude"],
"longitude": result["latLong"]["longitude"],
"status_text": result.get("statusText"),
"posted_time": result["variableData"]["text"]
status_text=result.get("statusText"),
posted_time=self.parse_posted_time(result["variableData"]["text"])
if "variableData" in result
and "text" in result["variableData"]
and result["variableData"]["type"] == "TIME_ON_INFO"
and "text" in result["variableData"]
and result["variableData"]["type"] == "TIME_ON_INFO"
else None,
"img_src": result.get("imgSrc"),
"price_per_sqft": int(home_info["price"] // home_info["livingArea"])
if "livingArea" in home_info
and home_info["livingArea"] != 0
and "price" in home_info
price_min=home_info.get("price"),
price_max=home_info.get("price"),
beds_min=int(home_info["bedrooms"]) if "bedrooms" in home_info else None,
beds_max=int(home_info["bedrooms"]) if "bedrooms" in home_info else None,
baths_min=home_info.get("bathrooms"),
baths_max=home_info.get("bathrooms"),
sqft_min=int(home_info["livingArea"]) if "livingArea" in home_info else None,
sqft_max=int(home_info["livingArea"]) if "livingArea" in home_info else None,
price_per_sqft=int(home_info["price"] // home_info["livingArea"])
if "livingArea" in home_info and home_info["livingArea"] != 0 and "price" in home_info
else None,
}
property_obj = Property(**property_data)
latitude=result["latLong"]["latitude"],
longitude=result["latLong"]["longitude"],
lot_area_value=round(home_info["lotAreaValue"], 2) if "lotAreaValue" in home_info else None,
lot_area_unit=home_info.get("lotAreaUnit"),
img_src=result.get("imgSrc"),
)
properties_list.append(property_obj)
elif "isBuilding" in result:
price = result["price"]
building_data = {
"property_url": f"https://www.zillow.com{result['detailUrl']}",
"site_name": self.site_name,
"property_type": PropertyType("BUILDING"),
"listing_type": ListingType(result["statusType"]),
"img_src": result["imgSrc"],
"price": int(price.replace("From $", "").replace(",", ""))
if "From $" in price
else None,
"apt_min_price": int(
price.replace("$", "").replace(",", "").replace("+/mo", "")
)
if "+/mo" in price
else None,
"address": self._extract_address(result["address"]),
"bldg_min_beds": result["minBeds"],
"currency": "USD",
"bldg_min_baths": result["minBaths"],
"bldg_min_area": result.get("minArea"),
"bldg_unit_count": result["unitCount"],
"bldg_name": result.get("communityName"),
"status_text": result["statusText"],
"latitude": result["latLong"]["latitude"],
"longitude": result["latLong"]["longitude"],
}
building_obj = Property(**building_data)
price_string = result["price"].replace("$", "").replace(",", "").replace("+/mo", "")
match = re.search(r"(\d+)", price_string)
price_value = int(match.group(1)) if match else None
building_obj = Property(
property_url=f"https://www.zillow.com{result['detailUrl']}",
site_name=self.site_name,
property_type=PropertyType("BUILDING"),
listing_type=ListingType(result["statusType"]),
img_src=result.get("imgSrc"),
address=self._extract_address(result["address"]),
baths_min=result.get("minBaths"),
area_min=result.get("minArea"),
bldg_name=result.get("communityName"),
status_text=result.get("statusText"),
price_min=price_value if "+/mo" in result.get("price") else None,
price_max=price_value if "+/mo" in result.get("price") else None,
latitude=result.get("latLong", {}).get("latitude"),
longitude=result.get("latLong", {}).get("longitude"),
unit_count=result.get("unitCount"),
)
properties_list.append(building_obj)
return properties_list
@@ -241,43 +265,43 @@ class ZillowScraper(Scraper):
else property_data["hdpUrl"]
)
address_data = property_data["address"]
street_address, unit = parse_address_two(address_data["streetAddress"])
address_one, address_two = parse_address_one(address_data["streetAddress"])
address = Address(
street_address=street_address,
unit=unit,
address_one=address_one,
address_two=address_two if address_two else "#",
city=address_data["city"],
state=address_data["state"],
zip_code=address_data["zipcode"],
country=property_data.get("country"),
)
property_type = property_data.get("homeType", None)
return Property(
site_name=self.site_name,
address=address,
property_url=url,
beds=property_data.get("bedrooms", None),
baths=property_data.get("bathrooms", None),
year_built=property_data.get("yearBuilt", None),
price=property_data.get("price", None),
tax_assessed_value=property_data.get("taxAssessedValue", None),
property_type=PropertyType(property_type) if property_type in PropertyType.__members__ else None,
listing_type=self.listing_type,
address=address,
year_built=property_data.get("yearBuilt"),
tax_assessed_value=property_data.get("taxAssessedValue"),
lot_area_value=property_data.get("lotAreaValue"),
lot_area_unit=property_data["lotAreaUnits"].lower() if "lotAreaUnits" in property_data else None,
agent=Agent(
name=property_data.get("attributionInfo", {}).get("agentName")
),
stories=property_data.get("resoFacts", {}).get("stories"),
mls_id=property_data.get("attributionInfo", {}).get("mlsId"),
beds_min=property_data.get("bedrooms"),
beds_max=property_data.get("bedrooms"),
baths_min=property_data.get("bathrooms"),
baths_max=property_data.get("bathrooms"),
price_min=property_data.get("price"),
price_max=property_data.get("price"),
sqft_min=property_data.get("livingArea"),
sqft_max=property_data.get("livingArea"),
price_per_sqft=property_data.get("resoFacts", {}).get("pricePerSquareFoot"),
latitude=property_data.get("latitude"),
longitude=property_data.get("longitude"),
img_src=property_data.get("streetViewTileImageUrlMediumAddress"),
currency=property_data.get("currency", None),
lot_area_value=property_data.get("lotAreaValue"),
lot_area_unit=property_data["lotAreaUnits"].lower()
if "lotAreaUnits" in property_data
else None,
agent_name=property_data.get("attributionInfo", {}).get("agentName", None),
stories=property_data.get("resoFacts", {}).get("stories", None),
description=property_data.get("description", None),
mls_id=property_data.get("attributionInfo", {}).get("mlsId", None),
price_per_sqft=property_data.get("resoFacts", {}).get(
"pricePerSquareFoot", None
),
square_feet=property_data.get("livingArea", None),
property_type=PropertyType(property_type),
listing_type=self.listing_type,
description=property_data.get("description"),
)
def _extract_address(self, address_str):
@@ -290,7 +314,7 @@ class ZillowScraper(Scraper):
if len(parts) != 3:
raise ValueError(f"Unexpected address format: {address_str}")
street_address = parts[0].strip()
address_one = parts[0].strip()
city = parts[1].strip()
state_zip = parts[2].split(" ")
@@ -303,31 +327,11 @@ class ZillowScraper(Scraper):
else:
raise ValueError(f"Unexpected state/zip format in address: {address_str}")
street_address, unit = parse_address_two(street_address)
address_one, address_two = parse_address_one(address_one)
return Address(
street_address=street_address,
address_one=address_one,
address_two=address_two if address_two else "#",
city=city,
unit=unit,
state=state,
zip_code=zip_code,
country="USA",
)
@staticmethod
def _get_headers():
return {
"authority": "www.zillow.com",
"accept": "*/*",
"accept-language": "en-US,en;q=0.9",
"content-type": "application/json",
"cookie": 'zjs_user_id=null; zg_anonymous_id=%220976ab81-2950-4013-98f0-108b15a554d2%22; zguid=24|%246b1bc625-3955-4d1e-a723-e59602e4ed08; g_state={"i_p":1693611172520,"i_l":1}; zgsession=1|d48820e2-1659-4d2f-b7d2-99a8127dd4f3; zjs_anonymous_id=%226b1bc625-3955-4d1e-a723-e59602e4ed08%22; JSESSIONID=82E8274D3DC8AF3AB9C8E613B38CF861; search=6|1697585860120%7Crb%3DDallas%252C-TX%26rect%3D33.016646%252C-96.555516%252C32.618763%252C-96.999347%26disp%3Dmap%26mdm%3Dauto%26sort%3Ddays%26listPriceActive%3D1%26fs%3D1%26fr%3D0%26mmm%3D0%26rs%3D0%26ah%3D0%26singlestory%3D0%26abo%3D0%26garage%3D0%26pool%3D0%26ac%3D0%26waterfront%3D0%26finished%3D0%26unfinished%3D0%26cityview%3D0%26mountainview%3D0%26parkview%3D0%26waterview%3D0%26hoadata%3D1%263dhome%3D0%26commuteMode%3Ddriving%26commuteTimeOfDay%3Dnow%09%0938128%09%7B%22isList%22%3Atrue%2C%22isMap%22%3Atrue%7D%09%09%09%09%09; AWSALB=gAlFj5Ngnd4bWP8k7CME/+YlTtX9bHK4yEkdPHa3VhL6K523oGyysFxBEpE1HNuuyL+GaRPvt2i/CSseAb+zEPpO4SNjnbLAJzJOOO01ipnWN3ZgPaa5qdv+fAki; AWSALBCORS=gAlFj5Ngnd4bWP8k7CME/+YlTtX9bHK4yEkdPHa3VhL6K523oGyysFxBEpE1HNuuyL+GaRPvt2i/CSseAb+zEPpO4SNjnbLAJzJOOO01ipnWN3ZgPaa5qdv+fAki; search=6|1697587741808%7Crect%3D33.37188814545521%2C-96.34484483007813%2C32.260490641365685%2C-97.21001816992188%26disp%3Dmap%26mdm%3Dauto%26p%3D1%26sort%3Ddays%26z%3D1%26listPriceActive%3D1%26fs%3D1%26fr%3D0%26mmm%3D0%26rs%3D0%26ah%3D0%26singlestory%3D0%26housing-connector%3D0%26abo%3D0%26garage%3D0%26pool%3D0%26ac%3D0%26waterfront%3D0%26finished%3D0%26unfinished%3D0%26cityview%3D0%26mountainview%3D0%26parkview%3D0%26waterview%3D0%26hoadata%3D1%26zillow-owned%3D0%263dhome%3D0%26featuredMultiFamilyBuilding%3D0%26commuteMode%3Ddriving%26commuteTimeOfDay%3Dnow%09%09%09%7B%22isList%22%3Atrue%2C%22isMap%22%3Atrue%7D%09%09%09%09%09',
"origin": "https://www.zillow.com",
"referer": "https://www.zillow.com",
"sec-ch-ua": '"Chromium";v="116", "Not)A;Brand";v="24", "Google Chrome";v="116"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"Windows"',
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-origin",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36",
}

View File

@@ -12,3 +12,7 @@ class NoResultsFound(Exception):
class GeoCoordsNotFound(Exception):
"""Raised when no property is found for the given address"""
class SearchTooBroad(Exception):
"""Raised when the search is too broad"""

View File

@@ -1,9 +1,9 @@
import re
def parse_address_two(street_address: str) -> tuple:
def parse_address_one(street_address: str) -> tuple:
if not street_address:
return street_address, None
return street_address, "#"
apt_match = re.search(
r"(APT\s*[\dA-Z]+|#[\dA-Z]+|UNIT\s*[\dA-Z]+|LOT\s*[\dA-Z]+|SUITE\s*[\dA-Z]+)$",
@@ -13,36 +13,26 @@ def parse_address_two(street_address: str) -> tuple:
if apt_match:
apt_str = apt_match.group().strip()
cleaned_apt_str = re.sub(
r"(APT\s*|UNIT\s*|LOT\s*|SUITE\s*)", "#", apt_str, flags=re.I
)
cleaned_apt_str = re.sub(r"(APT\s*|UNIT\s*|LOT\s*|SUITE\s*)", "#", apt_str, flags=re.I)
main_address = street_address.replace(apt_str, "").strip()
return main_address, cleaned_apt_str
else:
return street_address, None
return street_address, "#"
def parse_unit(street_address: str):
def parse_address_two(street_address: str):
if not street_address:
return None
return "#"
apt_match = re.search(
r"(APT\s*[\dA-Z]+|#[\dA-Z]+|UNIT\s*[\dA-Z]+|LOT\s*[\dA-Z]+)$",
r"(APT\s*[\dA-Z]+|#[\dA-Z]+|UNIT\s*[\dA-Z]+|LOT\s*[\dA-Z]+|SUITE\s*[\dA-Z]+)$",
street_address,
re.I,
)
if apt_match:
apt_str = apt_match.group().strip()
apt_str = re.sub(r"(APT\s*|UNIT\s*|LOT\s*)", "#", apt_str, flags=re.I)
apt_str = re.sub(r"(APT\s*|UNIT\s*|LOT\s*|SUITE\s*)", "#", apt_str, flags=re.I)
return apt_str
else:
return None
if __name__ == "__main__":
print(parse_address_two("4303 E Cactus Rd Apt 126"))
print(parse_address_two("1234 Elm Street apt 2B"))
print(parse_address_two("1234 Elm Street UNIT 3A"))
print(parse_address_two("1234 Elm Street unit 3A"))
print(parse_address_two("1234 Elm Street SuIte 3A"))
return "#"

13
poetry.lock generated
View File

@@ -408,6 +408,17 @@ files = [
{file = "six-1.16.0.tar.gz", hash = "sha256:1e61c37477a1626458e36f7b1d82aa5c9b094fa4802892072e49de9c60c4c926"},
]
[[package]]
name = "tls-client"
version = "0.2.2"
description = "Advanced Python HTTP Client."
optional = false
python-versions = "*"
files = [
{file = "tls_client-0.2.2-py3-none-any.whl", hash = "sha256:30934871397cdad6862e00b5634f382666314a452ddd3d774e18323a0ad9b765"},
{file = "tls_client-0.2.2.tar.gz", hash = "sha256:78bc0e291e3aadc6c5e903b62bb26c01374577691f2a9e5e17899900a5927a13"},
]
[[package]]
name = "tomli"
version = "2.0.1"
@@ -450,4 +461,4 @@ zstd = ["zstandard (>=0.18.0)"]
[metadata]
lock-version = "2.0"
python-versions = "^3.10"
content-hash = "3647d568f5623dd762f19029230626a62e68309fa2ef8be49a36382c19264a5f"
content-hash = "9b77e1a09fcf2cf5e7e6be53f304cd21a6a51ea51680d661a178afe5e5343670"

View File

@@ -1,6 +1,6 @@
[tool.poetry]
name = "homeharvest"
version = "0.2.5"
version = "0.2.17"
description = "Real estate scraping library supporting Zillow, Realtor.com & Redfin."
authors = ["Zachary Hampton <zachary@zacharysproducts.com>", "Cullen Watson <cullen@cullen.ai>"]
homepage = "https://github.com/ZacharyHampton/HomeHarvest"
@@ -14,6 +14,7 @@ python = "^3.10"
requests = "^2.31.0"
pandas = "^2.1.0"
openpyxl = "^3.1.2"
tls-client = "^0.2.2"
[tool.poetry.group.dev.dependencies]
@@ -21,4 +22,4 @@ pytest = "^7.4.2"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
build-backend = "poetry.core.masonry.api"

View File

@@ -4,20 +4,16 @@ from homeharvest.exceptions import (
InvalidListingType,
NoResultsFound,
GeoCoordsNotFound,
SearchTooBroad,
)
def test_redfin():
results = [
scrape_property(
location="2530 Al Lipscomb Way", site_name="redfin", listing_type="for_sale"
),
scrape_property(
location="Phoenix, AZ, USA", site_name=["redfin"], listing_type="for_rent"
),
scrape_property(
location="Dallas, TX, USA", site_name="redfin", listing_type="sold"
),
scrape_property(location="San Diego", site_name="redfin", listing_type="for_sale"),
scrape_property(location="2530 Al Lipscomb Way", site_name="redfin", listing_type="for_sale"),
scrape_property(location="Phoenix, AZ, USA", site_name=["redfin"], listing_type="for_rent"),
scrape_property(location="Dallas, TX, USA", site_name="redfin", listing_type="sold"),
scrape_property(location="85281", site_name="redfin"),
]
@@ -30,9 +26,10 @@ def test_redfin():
location="abceefg ju098ot498hh9",
site_name="redfin",
listing_type="for_sale",
)
),
scrape_property(location="Florida", site_name="redfin", listing_type="for_rent"),
]
except (InvalidSite, InvalidListingType, NoResultsFound, GeoCoordsNotFound):
except (InvalidSite, InvalidListingType, NoResultsFound, GeoCoordsNotFound, SearchTooBroad):
assert True
assert all([result is None for result in bad_results])

24
tests/test_utils.py Normal file
View File

@@ -0,0 +1,24 @@
from homeharvest.utils import parse_address_one, parse_address_two
def test_parse_address_one():
test_data = [
("4303 E Cactus Rd Apt 126", ("4303 E Cactus Rd", "#126")),
("1234 Elm Street apt 2B", ("1234 Elm Street", "#2B")),
("1234 Elm Street UNIT 3A", ("1234 Elm Street", "#3A")),
("1234 Elm Street unit 3A", ("1234 Elm Street", "#3A")),
("1234 Elm Street SuIte 3A", ("1234 Elm Street", "#3A")),
]
for input_data, (exp_addr_one, exp_addr_two) in test_data:
address_one, address_two = parse_address_one(input_data)
assert address_one == exp_addr_one
assert address_two == exp_addr_two
def test_parse_address_two():
test_data = [("Apt 126", "#126"), ("apt 2B", "#2B"), ("UNIT 3A", "#3A"), ("unit 3A", "#3A"), ("SuIte 3A", "#3A")]
for input_data, expected in test_data:
output = parse_address_two(input_data)
assert output == expected

View File

@@ -9,16 +9,12 @@ from homeharvest.exceptions import (
def test_zillow():
results = [
scrape_property(
location="2530 Al Lipscomb Way", site_name="zillow", listing_type="for_sale"
),
scrape_property(
location="Phoenix, AZ, USA", site_name=["zillow"], listing_type="for_rent"
),
scrape_property(
location="Dallas, TX, USA", site_name="zillow", listing_type="sold"
),
scrape_property(location="2530 Al Lipscomb Way", site_name="zillow", listing_type="for_sale"),
scrape_property(location="Phoenix, AZ, USA", site_name=["zillow"], listing_type="for_rent"),
scrape_property(location="Surprise, AZ", site_name=["zillow"], listing_type="for_sale"),
scrape_property(location="Dallas, TX, USA", site_name="zillow", listing_type="sold"),
scrape_property(location="85281", site_name="zillow"),
scrape_property(location="3268 88th st s, Lakewood", site_name="zillow", listing_type="for_rent"),
]
assert all([result is not None for result in results])