[chore]: clean up

pull/31/head
Cullen Watson 2023-10-04 08:58:55 -05:00
parent f8c0dd766d
commit 51bde20c3c
8 changed files with 277 additions and 348 deletions

147
README.md
View File

@ -1,6 +1,6 @@
<img src="https://github.com/ZacharyHampton/HomeHarvest/assets/78247585/d1a2bf8b-09f5-4c57-b33a-0ada8a34f12d" width="400"> <img src="https://github.com/ZacharyHampton/HomeHarvest/assets/78247585/d1a2bf8b-09f5-4c57-b33a-0ada8a34f12d" width="400">
**HomeHarvest** is a simple, yet comprehensive, real estate scraping library. **HomeHarvest** is a simple, yet comprehensive, real estate scraping library that extracts and formats data in the style of MLS listings.
[![Try with Replit](https://replit.com/badge?caption=Try%20with%20Replit)](https://replit.com/@ZacharyHampton/HomeHarvestDemo) [![Try with Replit](https://replit.com/badge?caption=Try%20with%20Replit)](https://replit.com/@ZacharyHampton/HomeHarvestDemo)
@ -11,10 +11,14 @@
Check out another project we wrote: ***[JobSpy](https://github.com/cullenwatson/JobSpy)** a Python package for job scraping* Check out another project we wrote: ***[JobSpy](https://github.com/cullenwatson/JobSpy)** a Python package for job scraping*
## Features ## HomeHarvest Features
- Scrapes properties from **Zillow**, **Realtor.com** & **Redfin** simultaneously - **Source**: Fetches properties directly from **Realtor.com**.
- Aggregates the properties in a Pandas DataFrame - **Data Format**: Structures data to resemble MLS listings.
- **Export Flexibility**: Options to save as either CSV or Excel.
- **Usage Modes**:
- **CLI**: For users who prefer command-line operations.
- **Python**: For those who'd like to integrate scraping into their Python scripts.
[Video Guide for HomeHarvest](https://youtu.be/JnV7eR2Ve2o) - _updated for release v0.2.7_ [Video Guide for HomeHarvest](https://youtu.be/JnV7eR2Ve2o) - _updated for release v0.2.7_
@ -29,21 +33,6 @@ pip install homeharvest
## Usage ## Usage
### Python
```py
from homeharvest import scrape_property
import pandas as pd
properties: pd.DataFrame = scrape_property(
location="85281",
listing_type="for_rent" # for_sale / sold
)
#: Note, to export to CSV or Excel, use properties.to_csv() or properties.to_excel().
print(properties)
```
### CLI ### CLI
``` ```
@ -55,7 +44,6 @@ positional arguments:
location Location to scrape (e.g., San Francisco, CA) location Location to scrape (e.g., San Francisco, CA)
options: options:
-h, --help show this help message and exit
-l {for_sale,for_rent,sold}, --listing_type {for_sale,for_rent,sold} -l {for_sale,for_rent,sold}, --listing_type {for_sale,for_rent,sold}
Listing type to scrape Listing type to scrape
-o {excel,csv}, --output {excel,csv} -o {excel,csv}, --output {excel,csv}
@ -72,104 +60,107 @@ options:
> homeharvest "San Francisco, CA" -l for_rent -o excel -f HomeHarvest > homeharvest "San Francisco, CA" -l for_rent -o excel -f HomeHarvest
``` ```
## Output ### Python
```py ```py
>>> properties.head() from homeharvest import scrape_property
property_url site_name listing_type apt_min_price apt_max_price ... from datetime import datetime
0 https://www.redfin.com/AZ/Tempe/1003-W-Washing... redfin for_rent 1666.0 2750.0 ...
1 https://www.redfin.com/AZ/Tempe/VELA-at-Town-L... redfin for_rent 1665.0 3763.0 ... # Generate filename based on current timestamp
2 https://www.redfin.com/AZ/Tempe/Camden-Tempe/a... redfin for_rent 1939.0 3109.0 ... current_timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
3 https://www.redfin.com/AZ/Tempe/Emerson-Park/a... redfin for_rent 1185.0 1817.0 ... filename = f"output/{current_timestamp}.csv"
4 https://www.redfin.com/AZ/Tempe/Rio-Paradiso-A... redfin for_rent 1470.0 2235.0 ...
[5 rows x 41 columns] properties = scrape_property(
location="San Diego, CA",
listing_type="sold", # for_sale, for_rent
)
print(f"Number of properties: {len(properties)}")
properties.to_csv(filename, index=False)
``` ```
### Parameters for `scrape_properties()`
## Output
```plaintext ```plaintext
>>> properties.head()
MLS MLS # Status Style ... COEDate LotSFApx PrcSqft Stories
0 SDCA 230018348 SOLD CONDOS ... 2023-10-03 290110 803 2
1 SDCA 230016614 SOLD TOWNHOMES ... 2023-10-03 None 838 3
2 SDCA 230016367 SOLD CONDOS ... 2023-10-03 30056 649 1
3 MRCA NDP2306335 SOLD SINGLE_FAMILY ... 2023-10-03 7519 661 2
4 SDCA 230014532 SOLD CONDOS ... 2023-10-03 None 752 1
[5 rows x 22 columns]
```
### Parameters for `scrape_property()`
```
Required Required
├── location (str): address in various formats e.g. just zip, full address, city/state, etc. ├── location (str): address in various formats e.g. just zip, full address, city/state, etc.
└── listing_type (enum): for_rent, for_sale, sold └── listing_type (enum): for_rent, for_sale, sold
Optional Optional
├── site_name (list[enum], default=all three sites): zillow, realtor.com, redfin ├── radius_for_comps (float): Radius in miles to find comparable properties based on individual addresses.
├── proxy (str): in format 'http://user:pass@host:port' or [https, socks] ├── sold_last_x_days (int): Number of past days to filter sold properties.
└── keep_duplicates (bool, default=False): whether to keep or remove duplicate properties based on address ├── proxy (str): in format 'http://user:pass@host:port'
``` ```
### Property Schema ### Property Schema
```plaintext ```plaintext
Property Property
├── Basic Information: ├── Basic Information:
│ ├── property_url (str) │ ├── property_url (str)
├── site_name (enum): zillow, redfin, realtor.com ├── mls (str)
├── listing_type (enum): for_sale, for_rent, sold ├── mls_id (str)
└── property_type (enum): house, apartment, condo, townhouse, single_family, multi_family, building └── status (str)
├── Address Details: ├── Address Details:
│ ├── street_address (str) │ ├── street (str)
│ ├── unit (str)
│ ├── city (str) │ ├── city (str)
│ ├── state (str) │ ├── state (str)
│ ├── zip_code (str) │ └── zip (str)
│ ├── unit (str)
│ └── country (str)
├── House for Sale Features: ├── Property Description:
│ ├── tax_assessed_value (int) │ ├── style (str)
│ ├── lot_area_value (float) │ ├── beds (int)
│ ├── lot_area_unit (str) │ ├── baths_full (int)
│ ├── stories (int) │ ├── baths_half (int)
│ ├── sqft (int)
│ ├── lot_sqft (int)
│ ├── sold_price (int)
│ ├── year_built (int) │ ├── year_built (int)
│ └── price_per_sqft (int) │ ├── garage (float)
│ └── stories (int)
├── Building for Sale and Apartment Details: ├── Property Listing Details:
│ ├── bldg_name (str) │ ├── list_price (int)
│ ├── beds_min (int) │ ├── list_date (str)
│ ├── beds_max (int) │ ├── last_sold_date (str)
│ ├── baths_min (float) │ ├── prc_sqft (int)
│ ├── baths_max (float) │ └── hoa_fee (int)
│ ├── sqft_min (int)
│ ├── sqft_max (int)
│ ├── price_min (int)
│ ├── price_max (int)
│ ├── area_min (int)
│ └── unit_count (int)
├── Miscellaneous Details: ├── Location Details:
│ ├── mls_id (str) │ ├── latitude (float)
│ ├── agent_name (str) │ ├── longitude (float)
│ ├── img_src (str) │ └── neighborhoods (str)
│ ├── description (str)
│ ├── status_text (str)
│ └── posted_time (str)
└── Location Details:
├── latitude (float)
└── longitude (float)
``` ```
## Supported Countries for Property Scraping ## Supported Countries for Property Scraping
* **Zillow**: contains listings in the **US** & **Canada**
* **Realtor.com**: mainly from the **US** but also has international listings * **Realtor.com**: mainly from the **US** but also has international listings
* **Redfin**: listings mainly in the **US**, **Canada**, & has expanded to some areas in **Mexico**
### Exceptions ### Exceptions
The following exceptions may be raised when using HomeHarvest: The following exceptions may be raised when using HomeHarvest:
- `InvalidSite` - valid options: `zillow`, `redfin`, `realtor.com`
- `InvalidListingType` - valid options: `for_sale`, `for_rent`, `sold` - `InvalidListingType` - valid options: `for_sale`, `for_rent`, `sold`
- `NoResultsFound` - no properties found from your input - `NoResultsFound` - no properties found from your input
- `GeoCoordsNotFound` - if Zillow scraper is not able to derive geo-coordinates from the location you input
## Frequently Asked Questions ## Frequently Asked Questions
--- ---
**Q: Encountering issues with your queries?** **Q: Encountering issues with your searches?**
**A:** Try a single site and/or broaden the location. If problems persist, [submit an issue](https://github.com/ZacharyHampton/HomeHarvest/issues). **A:** Try to broaden the location. If problems persist, [submit an issue](https://github.com/ZacharyHampton/HomeHarvest/issues).
--- ---
**Q: Received a Forbidden 403 response code?** **Q: Received a Forbidden 403 response code?**
**A:** This indicates that you have been blocked by the real estate site for sending too many requests. Currently, **Zillow** is particularly aggressive with blocking. We recommend: **A:** This indicates that you have been blocked by Realtor.com for sending too many requests. We recommend:
- Waiting a few seconds between requests. - Waiting a few seconds between requests.
- Trying a VPN to change your IP address. - Trying a VPN to change your IP address.

View File

@ -1,5 +1,4 @@
import pandas as pd import pandas as pd
from typing import Union
import concurrent.futures import concurrent.futures
from concurrent.futures import ThreadPoolExecutor from concurrent.futures import ThreadPoolExecutor
@ -7,7 +6,7 @@ from .core.scrapers import ScraperInput
from .utils import process_result, ordered_properties from .utils import process_result, ordered_properties
from .core.scrapers.realtor import RealtorScraper from .core.scrapers.realtor import RealtorScraper
from .core.scrapers.models import ListingType, Property, SiteName from .core.scrapers.models import ListingType, Property, SiteName
from .exceptions import InvalidSite, InvalidListingType from .exceptions import InvalidListingType
_scrapers = { _scrapers = {
@ -15,10 +14,7 @@ _scrapers = {
} }
def _validate_input(site_name: str, listing_type: str) -> None: def _validate_input(listing_type: str) -> None:
if site_name.lower() not in _scrapers:
raise InvalidSite(f"Provided site, '{site_name}', does not exist.")
if listing_type.upper() not in ListingType.__members__: if listing_type.upper() not in ListingType.__members__:
raise InvalidListingType(f"Provided listing type, '{listing_type}', does not exist.") raise InvalidListingType(f"Provided listing type, '{listing_type}', does not exist.")
@ -27,7 +23,7 @@ def _scrape_single_site(location: str, site_name: str, listing_type: str, radius
""" """
Helper function to scrape a single site. Helper function to scrape a single site.
""" """
_validate_input(site_name, listing_type) _validate_input(listing_type)
scraper_input = ScraperInput( scraper_input = ScraperInput(
location=location, location=location,
@ -40,6 +36,7 @@ def _scrape_single_site(location: str, site_name: str, listing_type: str, radius
site = _scrapers[site_name.lower()](scraper_input) site = _scrapers[site_name.lower()](scraper_input)
results = site.search() results = site.search()
print(f"found {len(results)}")
properties_dfs = [process_result(result) for result in results] properties_dfs = [process_result(result) for result in results]
if not properties_dfs: if not properties_dfs:
@ -50,22 +47,19 @@ def _scrape_single_site(location: str, site_name: str, listing_type: str, radius
def scrape_property( def scrape_property(
location: str, location: str,
#: site_name: Union[str, list[str]] = "realtor.com",
listing_type: str = "for_sale", listing_type: str = "for_sale",
radius: float = None, radius: float = None,
sold_last_x_days: int = None, sold_last_x_days: int = None,
proxy: str = None, proxy: str = None,
) -> pd.DataFrame: ) -> pd.DataFrame:
""" """
Scrape property from various sites from a given location and listing type. Scrape properties from Realtor.com based on a given location and listing type.
:param sold_last_x_days: Sold in last x days
:param radius: Radius in miles to find comparable properties on individual addresses
:param keep_duplicates:
:param proxy:
:param location: US Location (e.g. 'San Francisco, CA', 'Cook County, IL', '85281', '2530 Al Lipscomb Way') :param location: US Location (e.g. 'San Francisco, CA', 'Cook County, IL', '85281', '2530 Al Lipscomb Way')
:param site_name: Site name or list of site names (e.g. ['realtor.com', 'zillow'], 'redfin') :param listing_type: Listing type (e.g. 'for_sale', 'for_rent', 'sold'). Default is 'for_sale'.
:param listing_type: Listing type (e.g. 'for_sale', 'for_rent', 'sold') :param radius: Radius in miles to find comparable properties on individual addresses. Optional.
:param sold_last_x_days: Number of past days to filter sold properties. Optional.
:param proxy: Proxy IP address to be used for scraping. Optional.
:returns: pd.DataFrame containing properties :returns: pd.DataFrame containing properties
""" """
site_name = "realtor.com" site_name = "realtor.com"

View File

@ -38,7 +38,8 @@ def main():
parser.add_argument( parser.add_argument(
"-r", "-r",
"--radius", "--sold-properties-radius",
dest="sold_properties_radius", # This makes sure the parsed argument is stored as radius_for_comps in args
type=float, type=float,
default=None, default=None,
help="Get comparable properties within _ (eg. 0.0) miles. Only applicable for individual addresses." help="Get comparable properties within _ (eg. 0.0) miles. Only applicable for individual addresses."
@ -46,7 +47,7 @@ def main():
args = parser.parse_args() args = parser.parse_args()
result = scrape_property(args.location, args.listing_type, proxy=args.proxy) result = scrape_property(args.location, args.listing_type, radius_for_comps=args.radius_for_comps, proxy=args.proxy)
if not args.filename: if not args.filename:
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S") timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")

View File

@ -32,39 +32,34 @@ class Address:
@dataclass @dataclass
class Agent: class Description:
name: str style: str | None = None
phone: str | None = None beds: int | None = None
email: str | None = None baths_full: int | None = None
baths_half: int | None = None
sqft: int | None = None
lot_sqft: int | None = None
sold_price: int | None = None
year_built: int | None = None
garage: float | None = None
stories: int | None = None
@dataclass @dataclass
class Property: class Property:
property_url: str | None = None property_url: str
mls: str | None = None mls: str | None = None
mls_id: str | None = None mls_id: str | None = None
status: str | None = None status: str | None = None
style: str | None = None
beds: int | None = None
baths_full: int | None = None
baths_half: int | None = None
list_price: int | None = None
list_date: str | None = None
sold_price: int | None = None
last_sold_date: str | None = None
prc_sqft: float | None = None
est_sf: int | None = None
lot_sf: int | None = None
hoa_fee: int | None = None
address: Address | None = None address: Address | None = None
yr_blt: int | None = None list_price: int | None = None
list_date: str | None = None
last_sold_date: str | None = None
prc_sqft: int | None = None
hoa_fee: int | None = None
description: Description | None = None
latitude: float | None = None latitude: float | None = None
longitude: float | None = None longitude: float | None = None
stories: int | None = None
prkg_gar: float | None = None
neighborhoods: Optional[str] = None neighborhoods: Optional[str] = None

View File

@ -2,38 +2,26 @@
homeharvest.realtor.__init__ homeharvest.realtor.__init__
~~~~~~~~~~~~ ~~~~~~~~~~~~
This module implements the scraper for relator.com This module implements the scraper for realtor.com
""" """
from ..models import Property, Address, ListingType from typing import Dict, Union, Optional
from concurrent.futures import ThreadPoolExecutor, as_completed
from .. import Scraper from .. import Scraper
from ....exceptions import NoResultsFound from ....exceptions import NoResultsFound
from concurrent.futures import ThreadPoolExecutor, as_completed from ..models import Property, Address, ListingType, Description
class RealtorScraper(Scraper): class RealtorScraper(Scraper):
SEARCH_URL = "https://www.realtor.com/api/v1/rdc_search_srp?client_id=rdc-search-new-communities&schema=vesta"
PROPERTY_URL = "https://www.realtor.com/realestateandhomes-detail/"
ADDRESS_AUTOCOMPLETE_URL = "https://parser-external.geo.moveaws.com/suggest"
def __init__(self, scraper_input): def __init__(self, scraper_input):
self.counter = 1 self.counter = 1
super().__init__(scraper_input) super().__init__(scraper_input)
self.search_url = (
"https://www.realtor.com/api/v1/rdc_search_srp?client_id=rdc-search-new-communities&schema=vesta"
)
def handle_location(self): def handle_location(self):
headers = {
"authority": "parser-external.geo.moveaws.com",
"accept": "*/*",
"accept-language": "en-US,en;q=0.9",
"origin": "https://www.realtor.com",
"referer": "https://www.realtor.com/",
"sec-ch-ua": '"Chromium";v="116", "Not)A;Brand";v="24", "Google Chrome";v="116"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"Windows"',
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "cross-site",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36",
}
params = { params = {
"input": self.location, "input": self.location,
"client_id": self.listing_type.value.lower().replace("_", "-"), "client_id": self.listing_type.value.lower().replace("_", "-"),
@ -42,9 +30,8 @@ class RealtorScraper(Scraper):
} }
response = self.session.get( response = self.session.get(
"https://parser-external.geo.moveaws.com/suggest", self.ADDRESS_AUTOCOMPLETE_URL,
params=params, params=params,
headers=headers,
) )
response_json = response.json() response_json = response.json()
@ -70,22 +57,19 @@ class RealtorScraper(Scraper):
stories stories
} }
address { address {
address_validation_code
city
country
county
line
postal_code
state_code
street_direction
street_name
street_number street_number
street_name
street_suffix street_suffix
street_post_direction
unit_value
unit unit
unit_descriptor city
zip state_code
postal_code
location {
coordinate {
lat
lon
}
}
} }
basic { basic {
baths baths
@ -113,25 +97,24 @@ class RealtorScraper(Scraper):
"variables": variables, "variables": variables,
} }
response = self.session.post(self.search_url, json=payload) response = self.session.post(self.SEARCH_URL, json=payload)
response_json = response.json() response_json = response.json()
property_info = response_json["data"]["property"] property_info = response_json["data"]["property"]
return [ return [
Property( Property(
property_url="https://www.realtor.com/realestateandhomes-detail/"
+ property_info["details"]["permalink"],
stories=property_info["details"]["stories"],
mls_id=property_id, mls_id=property_id,
property_url=f"{self.PROPERTY_URL}{property_info['details']['permalink']}",
address=self._parse_address(property_info, search_type="handle_address"),
description=self._parse_description(property_info)
) )
] ]
def general_search(self, variables: dict, search_type: str, return_total: bool = False) -> list[Property] | int: def general_search(self, variables: dict, search_type: str) -> Dict[str, Union[int, list[Property]]]:
""" """
Handles a location area & returns a list of properties Handles a location area & returns a list of properties
""" """
results_query = """{ results_query = """{
count count
total total
@ -141,86 +124,87 @@ class RealtorScraper(Scraper):
status status
last_sold_price last_sold_price
last_sold_date last_sold_date
hoa { list_price
fee price_per_sqft
}
description { description {
sqft
beds
baths_full baths_full
baths_half baths_half
beds
lot_sqft lot_sqft
sqft
sold_price sold_price
year_built year_built
garage garage
sold_price sold_price
type type
sub_type
name name
stories stories
} }
source { source {
raw {
area
status
style
}
last_update_date
contract_date
id id
listing_id listing_id
name }
type hoa {
listing_href fee
community_id
management_id
corporation_id
subdivision_status
spec_id
plan_id
tier_rank
feed_type
} }
location { location {
address { address {
street_number
street_name
street_suffix
unit
city city
country
line
postal_code
state_code state_code
state postal_code
coordinate { coordinate {
lon lon
lat lat
} }
street_direction
street_name
street_number
street_post_direction
street_suffix
unit
} }
neighborhoods { neighborhoods {
name name
} }
} }
list_price
price_per_sqft
style_category_tags {
exterior
}
source {
id
}
} }
} }
}""" }"""
sold_date_param = ('sold_date: { min: "$today-%sD" }' % self.sold_last_x_days sold_date_param = ('sold_date: { min: "$today-%sD" }' % self.sold_last_x_days
if self.listing_type == ListingType.SOLD and self.sold_last_x_days is not None if self.listing_type == ListingType.SOLD and self.sold_last_x_days
else "") else "")
sort_param = ('sort: [{ field: sold_date, direction: desc }]'
if self.listing_type == ListingType.SOLD
else 'sort: [{ field: list_date, direction: desc }]')
if search_type == "area": if search_type == "comps":
print('general - comps')
query = (
"""query Property_search(
$coordinates: [Float]!
$radius: String!
$offset: Int!,
) {
property_search(
query: {
nearby: {
coordinates: $coordinates
radius: $radius
}
status: %s
%s
}
%s
limit: 200
offset: $offset
) %s""" % (
self.listing_type.value.lower(),
sold_date_param,
sort_param,
results_query
)
)
else:
print('general - not comps')
query = ( query = (
"""query Home_search( """query Home_search(
$city: String, $city: String,
@ -238,60 +222,27 @@ class RealtorScraper(Scraper):
status: %s status: %s
%s %s
} }
%s
limit: 200 limit: 200
offset: $offset offset: $offset
) %s""" ) %s"""
% ( % (
self.listing_type.value.lower(), self.listing_type.value.lower(),
sold_date_param, sold_date_param,
sort_param,
results_query results_query
) )
) )
elif search_type == "comp_address":
query = (
"""query Property_search(
$coordinates: [Float]!
$radius: String!
$offset: Int!,
) {
property_search(
query: {
nearby: {
coordinates: $coordinates
radius: $radius
}
%s
}
limit: 200
offset: $offset
) %s""" % (sold_date_param, results_query))
else:
query = (
"""query Property_search(
$property_id: [ID]!
$offset: Int!,
) {
property_search(
query: {
property_id: $property_id
%s
}
limit: 200
offset: $offset
) %s""" % (sold_date_param, results_query))
payload = { payload = {
"query": query, "query": query,
"variables": variables, "variables": variables,
} }
response = self.session.post(self.search_url, json=payload) response = self.session.post(self.SEARCH_URL, json=payload)
response.raise_for_status() response.raise_for_status()
response_json = response.json() response_json = response.json()
search_key = "home_search" if search_type == "area" else "property_search" search_key = "property_search" if search_type == "comps" else "home_search"
if return_total:
return response_json["data"][search_key]["total"]
properties: list[Property] = [] properties: list[Property] = []
@ -303,7 +254,7 @@ class RealtorScraper(Scraper):
or response_json["data"][search_key] is None or response_json["data"][search_key] is None
or "results" not in response_json["data"][search_key] or "results" not in response_json["data"][search_key]
): ):
return [] return {"total": 0, "properties": []}
for result in response_json["data"][search_key]["results"]: for result in response_json["data"][search_key]["results"]:
self.counter += 1 self.counter += 1
@ -312,16 +263,90 @@ class RealtorScraper(Scraper):
if "source" in result and isinstance(result["source"], dict) if "source" in result and isinstance(result["source"], dict)
else None else None
) )
mls_id = (
result["source"].get("listing_id")
if "source" in result and isinstance(result["source"], dict)
else None
)
if not mls_id: if not mls:
continue continue
# not type
able_to_get_lat_long = result and result.get("location") and result["location"].get("address") and result["location"]["address"].get("coordinate")
realty_property = Property(
mls=mls,
mls_id=result["source"].get("listing_id") if "source" in result and isinstance(result["source"], dict) else None,
property_url=f"{self.PROPERTY_URL}{result['property_id']}",
status=result["status"].upper(),
list_price=result["list_price"],
list_date=result["list_date"].split("T")[0] if result.get("list_date") else None,
prc_sqft=result.get("price_per_sqft"),
last_sold_date=result.get("last_sold_date"),
hoa_fee=result["hoa"]["fee"] if result.get("hoa") and isinstance(result["hoa"], dict) else None,
latitude=result["location"]["address"]["coordinate"].get("lat") if able_to_get_lat_long else None,
longitude=result["location"]["address"]["coordinate"].get("lon") if able_to_get_lat_long else None,
address=self._parse_address(result, search_type="general_search"),
neighborhoods=self._parse_neighborhoods(result),
description=self._parse_description(result)
)
properties.append(realty_property)
# print(response_json["data"]["property_search"], variables["offset"])
# print(response_json["data"]["home_search"]["total"], variables["offset"])
return {
"total": response_json["data"][search_key]["total"],
"properties": properties,
}
def search(self):
location_info = self.handle_location()
location_type = location_info["area_type"]
search_variables = {
"offset": 0,
}
search_type = "comps" if self.radius and location_type == "address" else "area"
print(search_type)
if location_type == "address":
if not self.radius: #: single address search, non comps
property_id = location_info["mpr_id"]
search_variables |= {"property_id": property_id}
return self.handle_address(property_id)
else: #: general search, comps (radius)
coordinates = list(location_info["centroid"].values())
search_variables |= {
"coordinates": coordinates,
"radius": "{}mi".format(self.radius),
}
else: #: general search, location
search_variables |= {
"city": location_info.get("city"),
"county": location_info.get("county"),
"state_code": location_info.get("state_code"),
"postal_code": location_info.get("postal_code"),
}
result = self.general_search(search_variables, search_type=search_type)
total = result["total"]
homes = result["properties"]
with ThreadPoolExecutor(max_workers=10) as executor:
futures = [
executor.submit(
self.general_search,
variables=search_variables | {"offset": i},
search_type=search_type,
)
for i in range(200, min(total, 10000), 200)
]
for future in as_completed(futures):
homes.extend(future.result()["properties"])
return homes
@staticmethod
def _parse_neighborhoods(result: dict) -> Optional[str]:
neighborhoods_list = [] neighborhoods_list = []
neighborhoods = result["location"].get("neighborhoods", []) neighborhoods = result["location"].get("neighborhoods", [])
@ -331,103 +356,38 @@ class RealtorScraper(Scraper):
if name: if name:
neighborhoods_list.append(name) neighborhoods_list.append(name)
neighborhoods_str = ( return ", ".join(neighborhoods_list) if neighborhoods_list else None
", ".join(neighborhoods_list) if neighborhoods_list else None
)
able_to_get_lat_long = result and result.get("location") and result["location"].get("address") and result["location"]["address"].get("coordinate") @staticmethod
def _parse_address(result: dict, search_type):
realty_property = Property( if search_type == "general_search":
property_url="https://www.realtor.com/realestateandhomes-detail/" return Address(
+ result["property_id"],
mls=mls,
mls_id=mls_id,
status=result["status"].upper(),
style=result["description"]["type"].upper(),
beds=result["description"]["beds"],
baths_full=result["description"]["baths_full"],
baths_half=result["description"]["baths_half"],
est_sf=result["description"]["sqft"],
lot_sf=result["description"]["lot_sqft"],
list_price=result["list_price"],
list_date=result["list_date"].split("T")[0]
if result["list_date"]
else None,
sold_price=result["description"]["sold_price"],
prc_sqft=result["price_per_sqft"],
last_sold_date=result["last_sold_date"],
hoa_fee=result["hoa"]["fee"] if result.get("hoa") and isinstance(result["hoa"], dict) else None,
address=Address(
street=f"{result['location']['address']['street_number']} {result['location']['address']['street_name']} {result['location']['address']['street_suffix']}", street=f"{result['location']['address']['street_number']} {result['location']['address']['street_name']} {result['location']['address']['street_suffix']}",
unit=result["location"]["address"]["unit"], unit=result["location"]["address"]["unit"],
city=result["location"]["address"]["city"], city=result["location"]["address"]["city"],
state=result["location"]["address"]["state_code"], state=result["location"]["address"]["state_code"],
zip=result["location"]["address"]["postal_code"], zip=result["location"]["address"]["postal_code"],
),
yr_blt=result["description"]["year_built"],
latitude=result["location"]["address"]["coordinate"].get("lat") if able_to_get_lat_long else None,
longitude=result["location"]["address"]["coordinate"].get("lon") if able_to_get_lat_long else None,
prkg_gar=result["description"]["garage"],
stories=result["description"]["stories"],
neighborhoods=neighborhoods_str,
) )
properties.append(realty_property) return Address(
street=f"{result['address']['street_number']} {result['address']['street_name']} {result['address']['street_suffix']}",
return properties unit=result['address']['unit'],
city=result['address']['city'],
def search(self): state=result['address']['state_code'],
location_info = self.handle_location() zip=result['address']['postal_code'],
location_type = location_info["area_type"]
is_for_comps = self.radius is not None and location_type == "address"
offset = 0
search_variables = {
"offset": offset,
}
search_type = "comp_address" if is_for_comps \
else "address" if location_type == "address" and not is_for_comps \
else "area"
if location_type == "address" and not is_for_comps: #: single address search, non comps
property_id = location_info["mpr_id"]
search_variables = search_variables | {"property_id": property_id}
general_search = self.general_search(search_variables, search_type)
if general_search:
return general_search
else:
return self.handle_address(property_id) #: TODO: support single address search for query by property address (can go from property -> listing to get better data)
elif not is_for_comps: #: area search
search_variables = search_variables | {
"city": location_info.get("city"),
"county": location_info.get("county"),
"state_code": location_info.get("state_code"),
"postal_code": location_info.get("postal_code"),
}
else: #: comps search
coordinates = list(location_info["centroid"].values())
search_variables = search_variables | {
"coordinates": coordinates,
"radius": "{}mi".format(self.radius),
}
total = self.general_search(search_variables, return_total=True, search_type=search_type)
homes = []
with ThreadPoolExecutor(max_workers=10) as executor:
futures = [
executor.submit(
self.general_search,
variables=search_variables | {"offset": i},
return_total=False,
search_type=search_type,
) )
for i in range(0, total, 200)
]
for future in as_completed(futures): @staticmethod
homes.extend(future.result()) def _parse_description(result: dict) -> Description:
description_data = result.get("description", {})
return homes return Description(
style=description_data.get("type", "").upper(),
beds=description_data.get("beds"),
baths_full=description_data.get("baths_full"),
baths_half=description_data.get("baths_half"),
sqft=description_data.get("sqft"),
lot_sqft=description_data.get("lot_sqft"),
sold_price=description_data.get("sold_price"),
year_built=description_data.get("year_built"),
garage=description_data.get("garage"),
stories=description_data.get("stories"),
)

View File

@ -1,18 +1,6 @@
class InvalidSite(Exception):
"""Raised when a provided site is does not exist."""
class InvalidListingType(Exception): class InvalidListingType(Exception):
"""Raised when a provided listing type is does not exist.""" """Raised when a provided listing type is does not exist."""
class NoResultsFound(Exception): class NoResultsFound(Exception):
"""Raised when no results are found for the given location""" """Raised when no results are found for the given location"""
class GeoCoordsNotFound(Exception):
"""Raised when no property is found for the given address"""
class SearchTooBroad(Exception):
"""Raised when the search is too broad"""

View File

@ -39,7 +39,6 @@ def process_result(result: Property) -> pd.DataFrame:
prop_data["MLS"] = prop_data["mls"] prop_data["MLS"] = prop_data["mls"]
prop_data["MLS #"] = prop_data["mls_id"] prop_data["MLS #"] = prop_data["mls_id"]
prop_data["Status"] = prop_data["status"] prop_data["Status"] = prop_data["status"]
prop_data["Style"] = prop_data["style"]
if "address" in prop_data: if "address" in prop_data:
address_data = prop_data["address"] address_data = prop_data["address"]
@ -49,26 +48,27 @@ def process_result(result: Property) -> pd.DataFrame:
prop_data["State"] = address_data.state prop_data["State"] = address_data.state
prop_data["Zip"] = address_data.zip prop_data["Zip"] = address_data.zip
prop_data["Community"] = prop_data["neighborhoods"]
prop_data["Beds"] = prop_data["beds"]
prop_data["FB"] = prop_data["baths_full"]
prop_data["NumHB"] = prop_data["baths_half"]
prop_data["EstSF"] = prop_data["est_sf"]
prop_data["ListPrice"] = prop_data["list_price"] prop_data["ListPrice"] = prop_data["list_price"]
prop_data["Lst Date"] = prop_data["list_date"] prop_data["Lst Date"] = prop_data["list_date"]
prop_data["Sold Price"] = prop_data["sold_price"]
prop_data["COEDate"] = prop_data["last_sold_date"] prop_data["COEDate"] = prop_data["last_sold_date"]
prop_data["LotSFApx"] = prop_data["lot_sf"] prop_data["PrcSqft"] = prop_data["prc_sqft"]
prop_data["HOAFee"] = prop_data["hoa_fee"] prop_data["HOAFee"] = prop_data["hoa_fee"]
if prop_data.get("prc_sqft") is not None: description = result.description
prop_data["PrcSqft"] = round(prop_data["prc_sqft"], 2) prop_data["Style"] = description.style
prop_data["Beds"] = description.beds
prop_data["FB"] = description.baths_full
prop_data["NumHB"] = description.baths_half
prop_data["EstSF"] = description.sqft
prop_data["LotSFApx"] = description.lot_sqft
prop_data["Sold Price"] = description.sold_price
prop_data["YrBlt"] = description.year_built
prop_data["PrkgGar"] = description.garage
prop_data["Stories"] = description.stories
prop_data["YrBlt"] = prop_data["yr_blt"]
prop_data["LATITUDE"] = prop_data["latitude"] prop_data["LATITUDE"] = prop_data["latitude"]
prop_data["LONGITUDE"] = prop_data["longitude"] prop_data["LONGITUDE"] = prop_data["longitude"]
prop_data["Stories"] = prop_data["stories"] prop_data["Community"] = prop_data["neighborhoods"]
prop_data["PrkgGar"] = prop_data["prkg_gar"]
properties_df = pd.DataFrame([prop_data]) properties_df = pd.DataFrame([prop_data])
properties_df = properties_df.reindex(columns=ordered_properties) properties_df = properties_df.reindex(columns=ordered_properties)