Compare commits

...

31 Commits

Author SHA1 Message Date
Zachary Hampton
2d75ca4dfa Merge pull request #131 from ZacharyHampton/feature/data-additions
Feature/data additions
2025-07-15 13:56:16 -07:00
Zachary Hampton
ca1be85a93 - delete test 2025-07-15 13:55:40 -07:00
Zachary Hampton
145c337b55 - data quality and clean up code 2025-07-15 13:51:47 -07:00
Zachary Hampton
6c6243eba4 - add all new data fields 2025-07-15 13:21:48 -07:00
Zachary Hampton
79082090cb - pydantic conversion 2025-07-15 12:25:43 -07:00
Zachary Hampton
8311f4dfbc - data additions 2025-07-15 12:00:19 -07:00
Zachary Hampton
0d85100091 - update dependencies 2025-07-14 17:08:27 -07:00
Zachary Hampton
851ba53d81 Merge pull request #128 from Alexandre-Shofstall/fix/python39-compat
Fix syntax of __init__ line 24
2025-07-03 10:28:49 -07:00
Zachary Hampton
0fdc309262 Update pyproject.toml 2025-07-03 10:28:14 -07:00
Alexandre Shofstall
62b6726d42 Fix syntax of __init__ line 24 2025-07-03 19:20:49 +02:00
Zachary Hampton
ccf5786ce2 Merge pull request #127 from Alexandre-Shofstall/fix/python39-compat
Fix typing syntax for Python 3.9 compatibility in __init__.py
2025-07-03 09:43:26 -07:00
Zachary Hampton
b4f05b254a Update pyproject.toml 2025-07-03 09:43:10 -07:00
Alexandre Shofstall
941d1081f7 Fix typing syntax for Python 3.9 compatibility in __init__.py 2025-07-03 18:11:18 +02:00
Zachary Hampton
c788b3318d Update README.md 2025-06-19 16:52:14 -07:00
zachary
68a3438c6e - single home return type bug fix 2025-05-05 12:29:36 -07:00
zachary
a3c5e9060e - updated queries 2025-05-03 13:55:56 -07:00
zachary
d06595fe56 - updated queries 2025-05-03 13:28:12 -07:00
zachary
e378feeefe - bug fixes 2025-04-12 18:34:35 -07:00
zachary
8a5683fe79 - return type parameter
- optimized get extra fields with query clustering
2025-04-12 17:55:52 -07:00
Zachary Hampton
65f799a27d Update README.md 2025-02-21 13:33:32 -07:00
Cullen Watson
0de916e590 enh:tax history 2025-01-06 05:28:36 -06:00
Cullen Watson
6a3f7df087 chore:yml 2024-11-05 23:55:59 -06:00
Cullen Watson
a75bcc2aa0 docs:readme 2024-11-04 10:22:32 -06:00
Cullen Watson
1082b86fa1 docs:readme 2024-11-03 17:23:58 -06:00
Cullen Watson
8e04f6b117 enh: property type (#102) 2024-11-03 17:23:07 -06:00
Zachary Hampton
1f717bd9e3 - switch eps
- new hrefs
- property_id, listing_id data points
2024-09-06 15:49:07 -07:00
Zachary Hampton
8cfe056f79 - office mls set 2024-08-23 10:54:43 -07:00
Zachary Hampton
1010c743b6 - agent mls set and nrds id 2024-08-23 10:47:45 -07:00
Zachary Hampton
32fdc281e3 - rewrote & optimized flow
- new_construction data point
- renamed "agent" & "broker" to "agent_name" & "broker_name"
- added builder & office data
- added entity uuids
2024-08-20 05:19:15 -07:00
Zachary Hampton
6d14b8df5a - fix limit parameter
- fix specific for_rent apartment listing prices
2024-08-13 10:44:11 -07:00
Zachary Hampton
3f44744d61 - primary photo bug fix
- limit parameter
2024-07-15 07:19:57 -07:00
17 changed files with 51245 additions and 1299 deletions

1
.github/FUNDING.yml vendored Normal file
View File

@@ -0,0 +1 @@
github: Bunsly

View File

@@ -2,18 +2,12 @@
**HomeHarvest** is a real estate scraping library that extracts and formats data in the style of MLS listings. **HomeHarvest** is a real estate scraping library that extracts and formats data in the style of MLS listings.
**Not technical?** Try out the web scraping tool on our site at [tryhomeharvest.com](https://tryhomeharvest.com).
*Looking to build a data-focused software product?* **[Book a call](https://bunsly.com)** *to work with us.*
## HomeHarvest Features ## HomeHarvest Features
- **Source**: Fetches properties directly from **Realtor.com**. - **Source**: Fetches properties directly from **Realtor.com**.
- **Data Format**: Structures data to resemble MLS listings. - **Data Format**: Structures data to resemble MLS listings.
- **Export Flexibility**: Options to save as either CSV or Excel. - **Export Flexibility**: Options to save as either CSV or Excel.
[Video Guide for HomeHarvest](https://youtu.be/J1qgNPgmSLI) - _updated for release v0.3.4_
![homeharvest](https://github.com/ZacharyHampton/HomeHarvest/assets/78247585/b3d5d727-e67b-4a9f-85d8-1e65fd18620a) ![homeharvest](https://github.com/ZacharyHampton/HomeHarvest/assets/78247585/b3d5d727-e67b-4a9f-85d8-1e65fd18620a)
## Installation ## Installation
@@ -40,6 +34,7 @@ properties = scrape_property(
listing_type="sold", # or (for_sale, for_rent, pending) listing_type="sold", # or (for_sale, for_rent, pending)
past_days=30, # sold in last 30 days - listed in last 30 days if (for_sale, for_rent) past_days=30, # sold in last 30 days - listed in last 30 days if (for_sale, for_rent)
# property_type=['single_family','multi_family'],
# date_from="2023-05-01", # alternative to past_days # date_from="2023-05-01", # alternative to past_days
# date_to="2023-05-28", # date_to="2023-05-28",
# foreclosure=True # foreclosure=True
@@ -68,13 +63,30 @@ print(properties.head())
``` ```
Required Required
├── location (str): The address in various formats - this could be just a zip code, a full address, or city/state, etc. ├── location (str): The address in various formats - this could be just a zip code, a full address, or city/state, etc.
── listing_type (option): Choose the type of listing. ── listing_type (option): Choose the type of listing.
- 'for_rent' - 'for_rent'
- 'for_sale' - 'for_sale'
- 'sold' - 'sold'
- 'pending' - 'pending' (for pending/contingent sales)
Optional Optional
├── property_type (list): Choose the type of properties.
- 'single_family'
- 'multi_family'
- 'condos'
- 'condo_townhome_rowhome_coop'
- 'condo_townhome'
- 'townhomes'
- 'duplex_triplex'
- 'farm'
- 'land'
- 'mobile'
├── return_type (option): Choose the return type.
│ - 'pandas' (default)
│ - 'pydantic'
│ - 'raw' (json)
├── radius (decimal): Radius in miles to find comparable properties based on individual addresses. ├── radius (decimal): Radius in miles to find comparable properties based on individual addresses.
│ Example: 5.5 (fetches properties within a 5.5-mile radius if location is set to a specific address; otherwise, ignored) │ Example: 5.5 (fetches properties within a 5.5-mile radius if location is set to a specific address; otherwise, ignored)
@@ -92,9 +104,11 @@ Optional
├── proxy (string): In format 'http://user:pass@host:port' ├── proxy (string): In format 'http://user:pass@host:port'
├── extra_property_data (True/False): Increases requests by O(n). If set, this fetches additional property data (e.g. agent, broker, property evaluations etc.) ├── extra_property_data (True/False): Increases requests by O(n). If set, this fetches additional property data for general searches (e.g. schools, tax appraisals etc.)
── exclude_pending (True/False): If set, excludes pending properties from the results unless listing_type is 'pending' ── exclude_pending (True/False): If set, excludes 'pending' properties from the 'for_sale' results unless listing_type is 'pending'
└── limit (integer): Limit the number of properties to fetch. Max & default is 10000.
``` ```
### Property Schema ### Property Schema
@@ -102,6 +116,8 @@ Optional
Property Property
├── Basic Information: ├── Basic Information:
│ ├── property_url │ ├── property_url
│ ├── property_id
│ ├── listing_id
│ ├── mls │ ├── mls
│ ├── mls_id │ ├── mls_id
│ └── status │ └── status
@@ -121,39 +137,60 @@ Property
│ ├── sqft │ ├── sqft
│ ├── year_built │ ├── year_built
│ ├── stories │ ├── stories
│ ├── garage
│ └── lot_sqft │ └── lot_sqft
├── Property Listing Details: ├── Property Listing Details:
│ ├── days_on_mls │ ├── days_on_mls
│ ├── list_price │ ├── list_price
│ ├── list_price_min
│ ├── list_price_max
│ ├── list_date │ ├── list_date
│ ├── pending_date │ ├── pending_date
│ ├── sold_price │ ├── sold_price
│ ├── last_sold_date │ ├── last_sold_date
│ ├── price_per_sqft │ ├── price_per_sqft
│ ├── parking_garage │ ├── new_construction
│ └── hoa_fee │ └── hoa_fee
├── Tax Information:
│ ├── year
│ ├── tax
│ ├── assessment
│ │ ├── building
│ │ ├── land
│ │ └── total
├── Location Details: ├── Location Details:
│ ├── latitude │ ├── latitude
│ ├── longitude │ ├── longitude
│ ├── nearby_schools │ ├── nearby_schools
├── Agent Info: ├── Agent Info:
│ ├── agent │ ├── agent_id
│ ├── agent_name
│ ├── agent_email │ ├── agent_email
│ └── agent_phone │ └── agent_phone
├── Broker Info: ├── Broker Info:
│ ├── broker │ ├── broker_id
── broker_email ── broker_name
│ └── broker_website
├── Builder Info:
│ ├── builder_id
│ └── builder_name
├── Office Info:
│ ├── office_id
│ ├── office_name
│ ├── office_phones
│ └── office_email
``` ```
### Exceptions ### Exceptions
The following exceptions may be raised when using HomeHarvest: The following exceptions may be raised when using HomeHarvest:
- `InvalidListingType` - valid options: `for_sale`, `for_rent`, `sold` - `InvalidListingType` - valid options: `for_sale`, `for_rent`, `sold`, `pending`.
- `InvalidDate` - date_from or date_to is not in the format YYYY-MM-DD. - `InvalidDate` - date_from or date_to is not in the format YYYY-MM-DD.
- `AuthenticationError` - Realtor.com token request failed. - `AuthenticationError` - Realtor.com token request failed.

View File

@@ -1,141 +0,0 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "cb48903e-5021-49fe-9688-45cd0bc05d0f",
"metadata": {
"is_executing": true
},
"outputs": [],
"source": [
"from homeharvest import scrape_property\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "156488ce-0d5f-43c5-87f4-c33e9c427860",
"metadata": {},
"outputs": [],
"source": [
"pd.set_option('display.max_columns', None) # Show all columns\n",
"pd.set_option('display.max_rows', None) # Show all rows\n",
"pd.set_option('display.width', None) # Auto-adjust display width to fit console\n",
"pd.set_option('display.max_colwidth', 50) # Limit max column width to 50 characters"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1c8b9744-8606-4e9b-8add-b90371a249a7",
"metadata": {},
"outputs": [],
"source": [
"# check for sale properties\n",
"scrape_property(\n",
" location=\"dallas\",\n",
" listing_type=\"for_sale\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aaf86093",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"# search a specific address\n",
"scrape_property(\n",
" location=\"2530 Al Lipscomb Way\",\n",
" listing_type=\"for_sale\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ab7b4c21-da1d-4713-9df4-d7425d8ce21e",
"metadata": {},
"outputs": [],
"source": [
"# check rentals\n",
"scrape_property(\n",
" location=\"chicago, illinois\",\n",
" listing_type=\"for_rent\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "af280cd3",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"# check sold properties\n",
"properties = scrape_property(\n",
" location=\"90210\",\n",
" listing_type=\"sold\",\n",
" past_days=10\n",
")\n",
"display(properties)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "628c1ce2",
"metadata": {
"collapsed": false,
"is_executing": true,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"# display clickable URLs\n",
"from IPython.display import display, HTML\n",
"properties['property_url'] = '<a href=\"' + properties['property_url'] + '\" target=\"_blank\">' + properties['property_url'] + '</a>'\n",
"\n",
"html = properties.to_html(escape=False)\n",
"truncate_width = f'<style>.dataframe td {{ max-width: 200px; overflow: hidden; text-overflow: ellipsis; white-space: nowrap; }}</style>{html}'\n",
"display(HTML(truncate_width))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -1,20 +0,0 @@
from homeharvest import scrape_property
from datetime import datetime
# Generate filename based on current timestamp
current_timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"HomeHarvest_{current_timestamp}.csv"
properties = scrape_property(
location="San Diego, CA",
listing_type="sold", # or (for_sale, for_rent)
past_days=30, # sold in last 30 days - listed in last x days if (for_sale, for_rent)
# pending_or_contingent=True # use on for_sale listings to find pending / contingent listings
# mls_only=True, # only fetch MLS listings
# proxy="http://user:pass@host:port" # use a proxy to change your IP address
)
print(f"Number of properties: {len(properties)}")
# Export to csv
properties.to_csv(filename, index=False)
print(properties.head())

104
examples/price_of_land.py Normal file
View File

@@ -0,0 +1,104 @@
"""
This script scrapes sold and pending sold land listings in past year for a list of zip codes and saves the data to individual Excel files.
It adds two columns to the data: 'lot_acres' and 'ppa' (price per acre) for user to analyze average price of land in a zip code.
"""
import os
import pandas as pd
from homeharvest import scrape_property
def get_property_details(zip: str, listing_type):
properties = scrape_property(location=zip, listing_type=listing_type, property_type=["land"], past_days=365)
if not properties.empty:
properties["lot_acres"] = properties["lot_sqft"].apply(lambda x: x / 43560 if pd.notnull(x) else None)
properties = properties[properties["sqft"].isnull()]
properties["ppa"] = properties.apply(
lambda row: (
int(
(
row["sold_price"]
if (pd.notnull(row["sold_price"]) and row["status"] == "SOLD")
else row["list_price"]
)
/ row["lot_acres"]
)
if pd.notnull(row["lot_acres"])
and row["lot_acres"] > 0
and (pd.notnull(row["sold_price"]) or pd.notnull(row["list_price"]))
else None
),
axis=1,
)
properties["ppa"] = properties["ppa"].astype("Int64")
selected_columns = [
"property_url",
"property_id",
"style",
"status",
"street",
"city",
"state",
"zip_code",
"county",
"list_date",
"last_sold_date",
"list_price",
"sold_price",
"lot_sqft",
"lot_acres",
"ppa",
]
properties = properties[selected_columns]
return properties
def output_to_excel(zip_code, sold_df, pending_df):
root_folder = os.getcwd()
zip_folder = os.path.join(root_folder, "zips", zip_code)
# Create zip code folder if it doesn't exist
os.makedirs(zip_folder, exist_ok=True)
# Define file paths
sold_file = os.path.join(zip_folder, f"{zip_code}_sold.xlsx")
pending_file = os.path.join(zip_folder, f"{zip_code}_pending.xlsx")
# Save individual sold and pending files
sold_df.to_excel(sold_file, index=False)
pending_df.to_excel(pending_file, index=False)
zip_codes = map(
str,
[
22920,
77024,
78028,
24553,
22967,
22971,
22922,
22958,
22969,
22949,
22938,
24599,
24562,
22976,
24464,
22964,
24581,
],
)
combined_df = pd.DataFrame()
for zip in zip_codes:
sold_df = get_property_details(zip, "sold")
pending_df = get_property_details(zip, "pending")
combined_df = pd.concat([combined_df, sold_df, pending_df], ignore_index=True)
output_to_excel(zip, sold_df, pending_df)
combined_file = os.path.join(os.getcwd(), "zips", "combined.xlsx")
combined_df.to_excel(combined_file, index=False)

View File

@@ -1,14 +1,16 @@
import warnings import warnings
import pandas as pd import pandas as pd
from .core.scrapers import ScraperInput from .core.scrapers import ScraperInput
from .utils import process_result, ordered_properties, validate_input, validate_dates from .utils import process_result, ordered_properties, validate_input, validate_dates, validate_limit
from .core.scrapers.realtor import RealtorScraper from .core.scrapers.realtor import RealtorScraper
from .core.scrapers.models import ListingType from .core.scrapers.models import ListingType, SearchPropertyType, ReturnType, Property
from typing import Union, Optional, List
def scrape_property( def scrape_property(
location: str, location: str,
listing_type: str = "for_sale", listing_type: str = "for_sale",
return_type: str = "pandas",
property_type: Optional[List[str]] = None,
radius: float = None, radius: float = None,
mls_only: bool = False, mls_only: bool = False,
past_days: int = None, past_days: int = None,
@@ -18,11 +20,14 @@ def scrape_property(
foreclosure: bool = None, foreclosure: bool = None,
extra_property_data: bool = True, extra_property_data: bool = True,
exclude_pending: bool = False, exclude_pending: bool = False,
) -> pd.DataFrame: limit: int = 10000
) -> Union[pd.DataFrame, list[dict], list[Property]]:
""" """
Scrape properties from Realtor.com based on a given location and listing type. Scrape properties from Realtor.com based on a given location and listing type.
:param location: Location to search (e.g. "Dallas, TX", "85281", "2530 Al Lipscomb Way") :param location: Location to search (e.g. "Dallas, TX", "85281", "2530 Al Lipscomb Way")
:param listing_type: Listing Type (for_sale, for_rent, sold, pending) :param listing_type: Listing Type (for_sale, for_rent, sold, pending)
:param return_type: Return type (pandas, pydantic, raw)
:param property_type: Property Type (single_family, multi_family, condos, condo_townhome_rowhome_coop, condo_townhome, townhomes, duplex_triplex, farm, land, mobile)
:param radius: Get properties within _ (e.g. 1.0) miles. Only applicable for individual addresses. :param radius: Get properties within _ (e.g. 1.0) miles. Only applicable for individual addresses.
:param mls_only: If set, fetches only listings with MLS IDs. :param mls_only: If set, fetches only listings with MLS IDs.
:param proxy: Proxy to use for scraping :param proxy: Proxy to use for scraping
@@ -31,13 +36,17 @@ def scrape_property(
:param foreclosure: If set, fetches only foreclosure listings. :param foreclosure: If set, fetches only foreclosure listings.
:param extra_property_data: Increases requests by O(n). If set, this fetches additional property data (e.g. agent, broker, property evaluations etc.) :param extra_property_data: Increases requests by O(n). If set, this fetches additional property data (e.g. agent, broker, property evaluations etc.)
:param exclude_pending: If true, this excludes pending or contingent properties from the results, unless listing type is pending. :param exclude_pending: If true, this excludes pending or contingent properties from the results, unless listing type is pending.
:param limit: Limit the number of results returned. Maximum is 10,000.
""" """
validate_input(listing_type) validate_input(listing_type)
validate_dates(date_from, date_to) validate_dates(date_from, date_to)
validate_limit(limit)
scraper_input = ScraperInput( scraper_input = ScraperInput(
location=location, location=location,
listing_type=ListingType[listing_type.upper()], listing_type=ListingType(listing_type.upper()),
return_type=ReturnType(return_type.lower()),
property_type=[SearchPropertyType[prop.upper()] for prop in property_type] if property_type else None,
proxy=proxy, proxy=proxy,
radius=radius, radius=radius,
mls_only=mls_only, mls_only=mls_only,
@@ -47,11 +56,15 @@ def scrape_property(
foreclosure=foreclosure, foreclosure=foreclosure,
extra_property_data=extra_property_data, extra_property_data=extra_property_data,
exclude_pending=exclude_pending, exclude_pending=exclude_pending,
limit=limit,
) )
site = RealtorScraper(scraper_input) site = RealtorScraper(scraper_input)
results = site.search() results = site.search()
if scraper_input.return_type != ReturnType.pandas:
return results
properties_dfs = [df for result in results if not (df := process_result(result)).empty] properties_dfs = [df for result in results if not (df := process_result(result)).empty]
if not properties_dfs: if not properties_dfs:
return pd.DataFrame() return pd.DataFrame()
@@ -59,4 +72,6 @@ def scrape_property(
with warnings.catch_warnings(): with warnings.catch_warnings():
warnings.simplefilter("ignore", category=FutureWarning) warnings.simplefilter("ignore", category=FutureWarning)
return pd.concat(properties_dfs, ignore_index=True, axis=0)[ordered_properties].replace({"None": pd.NA, None: pd.NA, "": pd.NA}) return pd.concat(properties_dfs, ignore_index=True, axis=0)[ordered_properties].replace(
{"None": pd.NA, None: pd.NA, "": pd.NA}
)

View File

@@ -1,18 +1,20 @@
from __future__ import annotations from __future__ import annotations
from dataclasses import dataclass from typing import Union
import requests import requests
from requests.adapters import HTTPAdapter from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry from urllib3.util.retry import Retry
import uuid import uuid
from ...exceptions import AuthenticationError from ...exceptions import AuthenticationError
from .models import Property, ListingType, SiteName from .models import Property, ListingType, SiteName, SearchPropertyType, ReturnType
import json import json
from pydantic import BaseModel
@dataclass class ScraperInput(BaseModel):
class ScraperInput:
location: str location: str
listing_type: ListingType listing_type: ListingType
property_type: list[SearchPropertyType] | None = None
radius: float | None = None radius: float | None = None
mls_only: bool | None = False mls_only: bool | None = False
proxy: str | None = None proxy: str | None = None
@@ -22,6 +24,8 @@ class ScraperInput:
foreclosure: bool | None = False foreclosure: bool | None = False
extra_property_data: bool | None = True extra_property_data: bool | None = True
exclude_pending: bool | None = False exclude_pending: bool | None = False
limit: int = 10000
return_type: ReturnType = ReturnType.pandas
class Scraper: class Scraper:
@@ -33,11 +37,12 @@ class Scraper:
): ):
self.location = scraper_input.location self.location = scraper_input.location
self.listing_type = scraper_input.listing_type self.listing_type = scraper_input.listing_type
self.property_type = scraper_input.property_type
if not self.session: if not self.session:
Scraper.session = requests.Session() Scraper.session = requests.Session()
retries = Retry( retries = Retry(
total=3, backoff_factor=3, status_forcelist=[429, 403], allowed_methods=frozenset(["GET", "POST"]) total=3, backoff_factor=4, status_forcelist=[429, 403], allowed_methods=frozenset(["GET", "POST"])
) )
adapter = HTTPAdapter(max_retries=retries) adapter = HTTPAdapter(max_retries=retries)
@@ -45,8 +50,21 @@ class Scraper:
Scraper.session.mount("https://", adapter) Scraper.session.mount("https://", adapter)
Scraper.session.headers.update( Scraper.session.headers.update(
{ {
"auth": f"Bearer {self.get_access_token()}", "accept": "application/json, text/javascript",
"apollographql-client-name": "com.move.Realtor-apollo-ios", "accept-language": "en-US,en;q=0.9",
"cache-control": "no-cache",
"content-type": "application/json",
"origin": "https://www.realtor.com",
"pragma": "no-cache",
"priority": "u=1, i",
"rdc-ab-tests": "commute_travel_time_variation:v1",
"sec-ch-ua": '"Not)A;Brand";v="99", "Google Chrome";v="127", "Chromium";v="127"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"Windows"',
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-origin",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36",
} }
) )
@@ -64,8 +82,10 @@ class Scraper:
self.foreclosure = scraper_input.foreclosure self.foreclosure = scraper_input.foreclosure
self.extra_property_data = scraper_input.extra_property_data self.extra_property_data = scraper_input.extra_property_data
self.exclude_pending = scraper_input.exclude_pending self.exclude_pending = scraper_input.exclude_pending
self.limit = scraper_input.limit
self.return_type = scraper_input.return_type
def search(self) -> list[Property]: ... def search(self) -> list[Union[Property | dict]]: ...
@staticmethod @staticmethod
def _parse_home(home) -> Property: ... def _parse_home(home) -> Property: ...
@@ -79,27 +99,29 @@ class Scraper:
response = requests.post( response = requests.post(
"https://graph.realtor.com/auth/token", "https://graph.realtor.com/auth/token",
headers={ headers={
'Host': 'graph.realtor.com', "Host": "graph.realtor.com",
'Accept': '*/*', "Accept": "*/*",
'Content-Type': 'Application/json', "Content-Type": "Application/json",
'X-Client-ID': 'rdc_mobile_native,iphone', "X-Client-ID": "rdc_mobile_native,iphone",
'X-Visitor-ID': device_id, "X-Visitor-ID": device_id,
'X-Client-Version': '24.21.23.679885', "X-Client-Version": "24.21.23.679885",
'Accept-Language': 'en-US,en;q=0.9', "Accept-Language": "en-US,en;q=0.9",
'User-Agent': 'Realtor.com/24.21.23.679885 CFNetwork/1494.0.7 Darwin/23.4.0', "User-Agent": "Realtor.com/24.21.23.679885 CFNetwork/1494.0.7 Darwin/23.4.0",
}, },
data=json.dumps({ data=json.dumps(
"grant_type": "device_mobile", {
"device_id": device_id, "grant_type": "device_mobile",
"client_app_id": "rdc_mobile_native,24.21.23.679885,iphone" "device_id": device_id,
})) "client_app_id": "rdc_mobile_native,24.21.23.679885,iphone",
}
),
)
data = response.json() data = response.json()
if not (access_token := data.get("access_token")): if not (access_token := data.get("access_token")):
raise AuthenticationError( raise AuthenticationError(
"Failed to get access token, use a proxy/vpn or wait a moment and try again.", "Failed to get access token, use a proxy/vpn or wait a moment and try again.", response=response
response=response
) )
return access_token return access_token

View File

@@ -1,7 +1,14 @@
from __future__ import annotations from __future__ import annotations
from dataclasses import dataclass
from enum import Enum from enum import Enum
from typing import Optional from typing import Optional, Any
from datetime import datetime
from pydantic import BaseModel, computed_field, HttpUrl, Field
class ReturnType(Enum):
pydantic = "pydantic"
pandas = "pandas"
raw = "raw"
class SiteName(Enum): class SiteName(Enum):
@@ -17,6 +24,20 @@ class SiteName(Enum):
raise ValueError(f"{value} not found in {cls}") raise ValueError(f"{value} not found in {cls}")
class SearchPropertyType(Enum):
SINGLE_FAMILY = "single_family"
APARTMENT = "apartment"
CONDOS = "condos"
CONDO_TOWNHOME_ROWHOME_COOP = "condo_townhome_rowhome_coop"
CONDO_TOWNHOME = "condo_townhome"
TOWNHOMES = "townhomes"
DUPLEX_TRIPLEX = "duplex_triplex"
FARM = "farm"
LAND = "land"
MULTI_FAMILY = "multi_family"
MOBILE = "mobile"
class ListingType(Enum): class ListingType(Enum):
FOR_SALE = "FOR_SALE" FOR_SALE = "FOR_SALE"
FOR_RENT = "FOR_RENT" FOR_RENT = "FOR_RENT"
@@ -24,12 +45,6 @@ class ListingType(Enum):
SOLD = "SOLD" SOLD = "SOLD"
@dataclass
class Agent:
name: str | None = None
phone: str | None = None
class PropertyType(Enum): class PropertyType(Enum):
APARTMENT = "APARTMENT" APARTMENT = "APARTMENT"
BUILDING = "BUILDING" BUILDING = "BUILDING"
@@ -54,80 +69,299 @@ class PropertyType(Enum):
OTHER = "OTHER" OTHER = "OTHER"
@dataclass class Address(BaseModel):
class Address:
full_line: str | None = None full_line: str | None = None
street: str | None = None street: str | None = None
unit: str | None = None unit: str | None = None
city: str | None = None city: str | None = Field(None, description="The name of the city")
state: str | None = None state: str | None = Field(None, description="The name of the state")
zip: str | None = None zip: str | None = Field(None, description="zip code")
# Additional address fields from GraphQL
street_direction: str | None = None
street_number: str | None = None
street_name: str | None = None
street_suffix: str | None = None
@computed_field
@property
def formatted_address(self) -> str | None:
"""Computed property that combines full_line, city, state, and zip into a formatted address."""
parts = []
if self.full_line:
parts.append(self.full_line)
city_state_zip = []
if self.city:
city_state_zip.append(self.city)
if self.state:
city_state_zip.append(self.state)
if self.zip:
city_state_zip.append(self.zip)
if city_state_zip:
parts.append(", ".join(city_state_zip))
return ", ".join(parts) if parts else None
@dataclass
class Description:
primary_photo: str | None = None class Description(BaseModel):
alt_photos: list[str] | None = None primary_photo: HttpUrl | None = None
alt_photos: list[HttpUrl] | None = None
style: PropertyType | None = None style: PropertyType | None = None
beds: int | None = None beds: int | None = Field(None, description="Total number of bedrooms")
baths_full: int | None = None baths_full: int | None = Field(None, description="Total number of full bathrooms (4 parts: Sink, Shower, Bathtub and Toilet)")
baths_half: int | None = None baths_half: int | None = Field(None, description="Total number of 1/2 bathrooms (2 parts: Usually Sink and Toilet)")
sqft: int | None = None sqft: int | None = Field(None, description="Square footage of the Home")
lot_sqft: int | None = None lot_sqft: int | None = Field(None, description="Lot square footage")
sold_price: int | None = None sold_price: int | None = Field(None, description="Sold price of home")
year_built: int | None = None year_built: int | None = Field(None, description="The year the building/home was built")
garage: float | None = None garage: float | None = Field(None, description="Number of garage spaces")
stories: int | None = None stories: int | None = Field(None, description="Number of stories in the building")
text: str | None = None text: str | None = None
# Additional description fields
name: str | None = None
type: str | None = None
@dataclass class AgentPhone(BaseModel):
class AgentPhone: #: For documentation purposes only (at the moment)
number: str | None = None number: str | None = None
type: str | None = None type: str | None = None
primary: bool | None = None primary: bool | None = None
ext: str | None = None ext: str | None = None
@dataclass class Entity(BaseModel):
class Agent: name: str | None = None # Make name optional since it can be None
name: str | None = None uuid: str | None = None
class Agent(Entity):
mls_set: str | None = None
nrds_id: str | None = None
phones: list[dict] | AgentPhone | None = None phones: list[dict] | AgentPhone | None = None
email: str | None = None email: str | None = None
href: str | None = None href: str | None = None
state_license: str | None = Field(None, description="Advertiser agent state license number")
@dataclass class Office(Entity):
class Broker: mls_set: str | None = None
name: str | None = None email: str | None = None
phone: str | None = None href: str | None = None
website: str | None = None phones: list[dict] | AgentPhone | None = None
@dataclass class Broker(Entity):
class Property: pass
property_url: str
class Builder(Entity):
pass
class Advertisers(BaseModel):
agent: Agent | None = None
broker: Broker | None = None
builder: Builder | None = None
office: Office | None = None
class Property(BaseModel):
property_url: HttpUrl
property_id: str = Field(..., description="Unique Home identifier also known as property id")
#: allows_cats: bool
#: allows_dogs: bool
listing_id: str | None = None
permalink: str | None = None
mls: str | None = None mls: str | None = None
mls_id: str | None = None mls_id: str | None = None
status: str | None = None status: str | None = Field(None, description="Listing status: for_sale, for_rent, sold, off_market, active (New Home Subdivisions), other (if none of the above conditions were met)")
address: Address | None = None address: Address | None = None
list_price: int | None = None list_price: int | None = Field(None, description="The current price of the Home")
list_date: str | None = None list_price_min: int | None = None
pending_date: str | None = None list_price_max: int | None = None
last_sold_date: str | None = None
list_date: datetime | None = Field(None, description="The time this Home entered Move system")
pending_date: datetime | None = Field(None, description="The date listing went into pending state")
last_sold_date: datetime | None = Field(None, description="Last time the Home was sold")
prc_sqft: int | None = None prc_sqft: int | None = None
hoa_fee: int | None = None new_construction: bool | None = Field(None, description="Search for new construction homes")
days_on_mls: int | None = None hoa_fee: int | None = Field(None, description="Search for homes where HOA fee is known and falls within specified range")
days_on_mls: int | None = Field(None, description="An integer value determined by the MLS to calculate days on market")
description: Description | None = None description: Description | None = None
tags: list[str] | None = None
details: list[HomeDetails] | None = None
latitude: float | None = None latitude: float | None = None
longitude: float | None = None longitude: float | None = None
neighborhoods: Optional[str] = None neighborhoods: Optional[str] = None
county: Optional[str] = None county: Optional[str] = Field(None, description="County associated with home")
fips_code: Optional[str] = None fips_code: Optional[str] = Field(None, description="The FIPS (Federal Information Processing Standard) code for the county")
agents: list[Agent] | None = None nearby_schools: list[str] | None = None
brokers: list[Broker] | None = None
nearby_schools: list[str] = None
assessed_value: int | None = None assessed_value: int | None = None
estimated_value: int | None = None estimated_value: int | None = None
tax: int | None = None
tax_history: list[TaxHistory] | None = None
advertisers: Advertisers | None = None
# Additional fields from GraphQL that aren't currently parsed
mls_status: str | None = None
last_sold_price: int | None = None
# Structured data from GraphQL
open_houses: list[OpenHouse] | None = None
pet_policy: PetPolicy | None = None
units: list[Unit] | None = None
monthly_fees: HomeMonthlyFee | None = Field(None, description="Monthly fees. Currently only some rental data will have them.")
one_time_fees: list[HomeOneTimeFee] | None = Field(None, description="One time fees. Currently only some rental data will have them.")
parking: HomeParkingDetails | None = Field(None, description="Parking information. Currently only some rental data will have it.")
terms: list[PropertyDetails] | None = None
popularity: Popularity | None = None
tax_record: TaxRecord | None = None
parcel_info: dict | None = None # Keep as dict for flexibility
current_estimates: list[PropertyEstimate] | None = None
estimates: HomeEstimates | None = None
photos: list[dict] | None = None # Keep as dict for photo structure
flags: HomeFlags | None = Field(None, description="Home flags for Listing/Property")
# Specialized models for GraphQL types
class HomeMonthlyFee(BaseModel):
description: str | None = None
display_amount: str | None = None
class HomeOneTimeFee(BaseModel):
description: str | None = None
display_amount: str | None = None
class HomeParkingDetails(BaseModel):
unassigned_space_rent: int | None = None
assigned_spaces_available: int | None = None
description: str | None = Field(None, description="Parking information. Currently only some rental data will have it.")
assigned_space_rent: int | None = None
class PetPolicy(BaseModel):
cats: bool | None = Field(None, description="Search for homes which allow cats")
dogs: bool | None = Field(None, description="Search for homes which allow dogs")
dogs_small: bool | None = Field(None, description="Search for homes with allow small dogs")
dogs_large: bool | None = Field(None, description="Search for homes which allow large dogs")
class OpenHouse(BaseModel):
start_date: datetime | None = None
end_date: datetime | None = None
description: str | None = None
time_zone: str | None = None
dst: bool | None = None
href: HttpUrl | None = None
methods: list[str] | None = None
class HomeFlags(BaseModel):
is_pending: bool | None = None
is_contingent: bool | None = None
is_new_construction: bool | None = None
is_coming_soon: bool | None = None
is_new_listing: bool | None = None
is_price_reduced: bool | None = None
is_foreclosure: bool | None = None
class PopularityPeriod(BaseModel):
clicks_total: int | None = None
views_total: int | None = None
dwell_time_mean: float | None = None
dwell_time_median: float | None = None
leads_total: int | None = None
shares_total: int | None = None
saves_total: int | None = None
last_n_days: int | None = None
class Popularity(BaseModel):
periods: list[PopularityPeriod] | None = None
class Assessment(BaseModel):
building: int | None = None
land: int | None = None
total: int | None = None
class TaxHistory(BaseModel):
assessment: Assessment | None = None
market: Assessment | None = Field(None, description="Market values as provided by the county or local taxing/assessment authority")
appraisal: Assessment | None = Field(None, description="Appraised value given by taxing authority")
value: Assessment | None = Field(None, description="Value closest to current market value used for assessment by county or local taxing authorities")
tax: int | None = None
year: int | None = None
assessed_year: int | None = Field(None, description="Assessment year for which taxes were billed")
class TaxRecord(BaseModel):
cl_id: str | None = None
public_record_id: str | None = None
last_update_date: datetime | None = None
apn: str | None = None
tax_parcel_id: str | None = None
class EstimateSource(BaseModel):
type: str | None = Field(None, description="Type of the avm vendor, list of values: corelogic, collateral, quantarium")
name: str | None = Field(None, description="Name of the avm vendor")
class PropertyEstimate(BaseModel):
estimate: int | None = Field(None, description="Estimated value of a property")
estimate_high: int | None = Field(None, description="Estimated high value of a property")
estimate_low: int | None = Field(None, description="Estimated low value of a property")
date: datetime | None = Field(None, description="Date of estimation")
is_best_home_value: bool | None = None
source: EstimateSource | None = Field(None, description="Source of the latest estimate value")
class HomeEstimates(BaseModel):
current_values: list[PropertyEstimate] | None = Field(None, description="Current valuation and best value for home from multiple AVM vendors")
class PropertyDetails(BaseModel):
category: str | None = None
text: list[str] | None = None
parent_category: str | None = None
class HomeDetails(BaseModel):
category: str | None = None
text: list[str] | None = None
parent_category: str | None = None
class UnitDescription(BaseModel):
baths_consolidated: str | None = None
baths: float | None = None # Changed to float to handle values like 2.5
beds: int | None = None
sqft: int | None = None
class UnitAvailability(BaseModel):
date: datetime | None = None
class Unit(BaseModel):
availability: UnitAvailability | None = None
description: UnitDescription | None = None
photos: list[dict] | None = None # Keep as dict for photo structure
list_price: int | None = None

View File

@@ -6,12 +6,32 @@ This module implements the scraper for realtor.com
""" """
from __future__ import annotations from __future__ import annotations
import json
from concurrent.futures import ThreadPoolExecutor, as_completed from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime from datetime import datetime
from typing import Dict, Union, Optional from json import JSONDecodeError
from typing import Dict, Union
from tenacity import (
retry,
retry_if_exception_type,
wait_exponential,
stop_after_attempt,
)
from .. import Scraper from .. import Scraper
from ..models import Property, Address, ListingType, Description, PropertyType, Agent, Broker from ..models import (
Property,
ListingType,
ReturnType
)
from .queries import GENERAL_RESULTS_QUERY, SEARCH_HOMES_DATA, HOMES_DATA, HOME_FRAGMENT
from .processors import (
process_property,
process_extra_property_details,
get_key
)
class RealtorScraper(Scraper): class RealtorScraper(Scraper):
@@ -20,6 +40,7 @@ class RealtorScraper(Scraper):
PROPERTY_GQL = "https://graph.realtor.com/graphql" PROPERTY_GQL = "https://graph.realtor.com/graphql"
ADDRESS_AUTOCOMPLETE_URL = "https://parser-external.geo.moveaws.com/suggest" ADDRESS_AUTOCOMPLETE_URL = "https://parser-external.geo.moveaws.com/suggest"
NUM_PROPERTY_WORKERS = 20 NUM_PROPERTY_WORKERS = 20
DEFAULT_PAGE_SIZE = 200
def __init__(self, scraper_input): def __init__(self, scraper_input):
super().__init__(scraper_input) super().__init__(scraper_input)
@@ -45,156 +66,6 @@ class RealtorScraper(Scraper):
return result[0] return result[0]
def handle_listing(self, listing_id: str) -> list[Property]:
query = """query Listing($listing_id: ID!) {
listing(id: $listing_id) {
source {
id
listing_id
}
address {
line
street_direction
street_number
street_name
street_suffix
unit
city
state_code
postal_code
location {
coordinate {
lat
lon
}
}
}
basic {
sqft
beds
baths_full
baths_half
lot_sqft
sold_price
sold_price
type
price
status
sold_date
list_date
}
details {
year_built
stories
garage
permalink
}
media {
photos {
href
}
}
}
}"""
variables = {"listing_id": listing_id}
payload = {
"query": query,
"variables": variables,
}
response = self.session.post(self.SEARCH_GQL_URL, json=payload)
response_json = response.json()
property_info = response_json["data"]["listing"]
mls = (
property_info["source"].get("id")
if "source" in property_info and isinstance(property_info["source"], dict)
else None
)
able_to_get_lat_long = (
property_info
and property_info.get("address")
and property_info["address"].get("location")
and property_info["address"]["location"].get("coordinate")
)
list_date_str = (
property_info["basic"]["list_date"].split("T")[0] if property_info["basic"].get("list_date") else None
)
last_sold_date_str = (
property_info["basic"]["sold_date"].split("T")[0] if property_info["basic"].get("sold_date") else None
)
pending_date_str = property_info["pending_date"].split("T")[0] if property_info.get("pending_date") else None
list_date = datetime.strptime(list_date_str, "%Y-%m-%d") if list_date_str else None
last_sold_date = datetime.strptime(last_sold_date_str, "%Y-%m-%d") if last_sold_date_str else None
pending_date = datetime.strptime(pending_date_str, "%Y-%m-%d") if pending_date_str else None
today = datetime.now()
days_on_mls = None
status = property_info["basic"]["status"].lower()
if list_date:
if status == "sold" and last_sold_date:
days_on_mls = (last_sold_date - list_date).days
elif status in ("for_sale", "for_rent"):
days_on_mls = (today - list_date).days
if days_on_mls and days_on_mls < 0:
days_on_mls = None
property_id = property_info["details"]["permalink"]
prop_details = self.get_prop_details(property_id)
style = property_info["basic"].get("type", "").upper()
listing = Property(
mls=mls,
mls_id=(
property_info["source"].get("listing_id")
if "source" in property_info and isinstance(property_info["source"], dict)
else None
),
property_url=f"{self.PROPERTY_URL}{property_id}",
status=property_info["basic"]["status"].upper(),
list_price=property_info["basic"]["price"],
list_date=list_date,
prc_sqft=(
property_info["basic"].get("price") / property_info["basic"].get("sqft")
if property_info["basic"].get("price") and property_info["basic"].get("sqft")
else None
),
last_sold_date=last_sold_date,
pending_date=pending_date,
latitude=property_info["address"]["location"]["coordinate"].get("lat") if able_to_get_lat_long else None,
longitude=property_info["address"]["location"]["coordinate"].get("lon") if able_to_get_lat_long else None,
address=self._parse_address(property_info, search_type="handle_listing"),
description=Description(
alt_photos=(
self.process_alt_photos(property_info["media"].get("photos", []))
if property_info.get("media")
else None
),
style=PropertyType.__getitem__(style) if style and style in PropertyType.__members__ else None,
beds=property_info["basic"].get("beds"),
baths_full=property_info["basic"].get("baths_full"),
baths_half=property_info["basic"].get("baths_half"),
sqft=property_info["basic"].get("sqft"),
lot_sqft=property_info["basic"].get("lot_sqft"),
sold_price=property_info["basic"].get("sold_price"),
year_built=property_info["details"].get("year_built"),
garage=property_info["details"].get("garage"),
stories=property_info["details"].get("stories"),
text=property_info.get("description", {}).get("text"),
),
days_on_mls=days_on_mls,
agents=prop_details.get("agents"),
brokers=prop_details.get("brokers"),
nearby_schools=prop_details.get("schools"),
assessed_value=prop_details.get("assessed_value"),
estimated_value=prop_details.get("estimated_value"),
)
return [listing]
def get_latest_listing_id(self, property_id: str) -> str | None: def get_latest_listing_id(self, property_id: str) -> str | None:
query = """query Property($property_id: ID!) { query = """query Property($property_id: ID!) {
property(id: $property_id) { property(id: $property_id) {
@@ -228,65 +99,15 @@ class RealtorScraper(Scraper):
else: else:
return property_info["listings"][0]["listing_id"] return property_info["listings"][0]["listing_id"]
def handle_address(self, property_id: str) -> list[Property]: def handle_home(self, property_id: str) -> list[Property]:
""" query = (
Handles a specific address & returns one property """query Home($property_id: ID!) {
""" home(property_id: $property_id) %s
query = """query Property($property_id: ID!) {
property(id: $property_id) {
property_id
details {
date_updated
garage
permalink
year_built
stories
}
address {
line
street_direction
street_number
street_name
street_suffix
unit
city
state_code
postal_code
location {
coordinate {
lat
lon
}
}
}
basic {
baths
beds
price
sqft
lot_sqft
type
sold_price
}
public_record {
lot_size
sqft
stories
units
year_built
}
primary_photo {
href
}
photos {
href
}
}
}""" }"""
% HOMES_DATA
)
variables = {"property_id": property_id} variables = {"property_id": property_id}
prop_details = self.get_prop_details(property_id)
payload = { payload = {
"query": query, "query": query,
"variables": variables, "variables": variables,
@@ -295,101 +116,20 @@ class RealtorScraper(Scraper):
response = self.session.post(self.SEARCH_GQL_URL, json=payload) response = self.session.post(self.SEARCH_GQL_URL, json=payload)
response_json = response.json() response_json = response.json()
property_info = response_json["data"]["property"] property_info = response_json["data"]["home"]
return [ if self.return_type != ReturnType.raw:
Property( return [process_property(property_info, self.mls_only, self.extra_property_data,
mls_id=property_id, self.exclude_pending, self.listing_type, get_key, process_extra_property_details)]
property_url=f"{self.PROPERTY_URL}{property_info['details']['permalink']}", else:
address=self._parse_address(property_info, search_type="handle_address"), return [property_info]
description=self._parse_description(property_info),
agents=prop_details.get("agents"),
brokers=prop_details.get("brokers"),
nearby_schools=prop_details.get("schools"),
assessed_value=prop_details.get("assessed_value"),
estimated_value=prop_details.get("estimated_value"),
)
]
def general_search(self, variables: dict, search_type: str) -> Dict[str, Union[int, list[Property]]]:
def general_search(self, variables: dict, search_type: str) -> Dict[str, Union[int, Union[list[Property], list[dict]]]]:
""" """
Handles a location area & returns a list of properties Handles a location area & returns a list of properties
""" """
results_query = """{
count
total
results {
pending_date
property_id
list_date
status
last_sold_price
last_sold_date
list_price
price_per_sqft
flags {
is_contingent
is_pending
}
description {
type
sqft
beds
baths_full
baths_half
lot_sqft
sold_price
year_built
garage
sold_price
type
name
stories
text
}
source {
id
listing_id
}
hoa {
fee
}
location {
address {
street_direction
street_number
street_name
street_suffix
line
unit
city
state_code
postal_code
coordinate {
lon
lat
}
}
county {
name
fips_code
}
neighborhoods {
name
}
}
tax_record {
public_record_id
}
primary_photo {
href
}
photos {
href
}
}
}
}"""
date_param = "" date_param = ""
if self.listing_type == ListingType.SOLD: if self.listing_type == ListingType.SOLD:
@@ -403,10 +143,15 @@ class RealtorScraper(Scraper):
elif self.last_x_days: elif self.last_x_days:
date_param = f'list_date: {{ min: "$today-{self.last_x_days}D" }}' date_param = f'list_date: {{ min: "$today-{self.last_x_days}D" }}'
property_type_param = ""
if self.property_type:
property_types = [pt.value for pt in self.property_type]
property_type_param = f"type: {json.dumps(property_types)}"
sort_param = ( sort_param = (
"sort: [{ field: sold_date, direction: desc }]" "sort: [{ field: sold_date, direction: desc }]"
if self.listing_type == ListingType.SOLD if self.listing_type == ListingType.SOLD
else "sort: [{ field: list_date, direction: desc }]" else "" #: "sort: [{ field: list_date, direction: desc }]" #: prioritize normal fractal sort from realtor
) )
pending_or_contingent_param = ( pending_or_contingent_param = (
@@ -437,17 +182,20 @@ class RealtorScraper(Scraper):
status: %s status: %s
%s %s
%s %s
%s
} }
%s %s
limit: 200 limit: 200
offset: $offset offset: $offset
) %s""" % ( ) %s
}""" % (
is_foreclosure, is_foreclosure,
listing_type.value.lower(), listing_type.value.lower(),
date_param, date_param,
property_type_param,
pending_or_contingent_param, pending_or_contingent_param,
sort_param, sort_param,
results_query, GENERAL_RESULTS_QUERY,
) )
elif search_type == "area": #: general search, came from a general location elif search_type == "area": #: general search, came from a general location
query = """query Home_search( query = """query Home_search(
@@ -467,17 +215,21 @@ class RealtorScraper(Scraper):
status: %s status: %s
%s %s
%s %s
%s
} }
bucket: { sort: "fractal_v1.1.3_fr" }
%s %s
limit: 200 limit: 200
offset: $offset offset: $offset
) %s""" % ( ) %s
}""" % (
is_foreclosure, is_foreclosure,
listing_type.value.lower(), listing_type.value.lower(),
date_param, date_param,
property_type_param,
pending_or_contingent_param, pending_or_contingent_param,
sort_param, sort_param,
results_query, GENERAL_RESULTS_QUERY,
) )
else: #: general search, came from an address else: #: general search, came from an address
query = ( query = (
@@ -485,14 +237,15 @@ class RealtorScraper(Scraper):
$property_id: [ID]! $property_id: [ID]!
$offset: Int!, $offset: Int!,
) { ) {
property_search( home_search(
query: { query: {
property_id: $property_id property_id: $property_id
} }
limit: 1 limit: 1
offset: $offset offset: $offset
) %s""" ) %s
% results_query }"""
% GENERAL_RESULTS_QUERY
) )
payload = { payload = {
@@ -504,7 +257,7 @@ class RealtorScraper(Scraper):
response_json = response.json() response_json = response.json()
search_key = "home_search" if "home_search" in query else "property_search" search_key = "home_search" if "home_search" in query else "property_search"
properties: list[Property] = [] properties: list[Union[Property, dict]] = []
if ( if (
response_json is None response_json is None
@@ -516,73 +269,43 @@ class RealtorScraper(Scraper):
): ):
return {"total": 0, "properties": []} return {"total": 0, "properties": []}
def process_property(result: dict) -> Property | None: properties_list = response_json["data"][search_key]["results"]
mls = result["source"].get("id") if "source" in result and isinstance(result["source"], dict) else None total_properties = response_json["data"][search_key]["total"]
offset = variables.get("offset", 0)
if not mls and self.mls_only: #: limit the number of properties to be processed
return #: example, if your offset is 200, and your limit is 250, return 50
properties_list: list[dict] = properties_list[: self.limit - offset]
able_to_get_lat_long = ( if self.extra_property_data:
result property_ids = [data["property_id"] for data in properties_list]
and result.get("location") extra_property_details = self.get_bulk_prop_details(property_ids) or {}
and result["location"].get("address")
and result["location"]["address"].get("coordinate")
)
is_pending = result["flags"].get("is_pending") or result["flags"].get("is_contingent") for result in properties_list:
specific_details_for_property = extra_property_details.get(result["property_id"], {})
if is_pending and (self.exclude_pending and self.listing_type != ListingType.PENDING): #: address is retrieved on both homes and search homes, so when merged, homes overrides,
return # this gets the internal data we want and only updates that (migrate to a func if more fields)
if "location" in specific_details_for_property:
result["location"].update(specific_details_for_property["location"])
del specific_details_for_property["location"]
property_id = result["property_id"] result.update(specific_details_for_property)
prop_details = self.get_prop_details(property_id) if self.extra_property_data else {}
realty_property = Property( if self.return_type != ReturnType.raw:
mls=mls, with ThreadPoolExecutor(max_workers=self.NUM_PROPERTY_WORKERS) as executor:
mls_id=( futures = [executor.submit(process_property, result, self.mls_only, self.extra_property_data,
result["source"].get("listing_id") self.exclude_pending, self.listing_type, get_key, process_extra_property_details) for result in properties_list]
if "source" in result and isinstance(result["source"], dict)
else None
),
property_url=(
f"{self.PROPERTY_URL}{property_id}"
if self.listing_type != ListingType.FOR_RENT
else f"{self.PROPERTY_URL}M{property_id}?listing_status=rental"
),
status="PENDING" if is_pending else result["status"].upper(),
list_price=result["list_price"],
list_date=result["list_date"].split("T")[0] if result.get("list_date") else None,
prc_sqft=result.get("price_per_sqft"),
last_sold_date=result.get("last_sold_date"),
hoa_fee=result["hoa"]["fee"] if result.get("hoa") and isinstance(result["hoa"], dict) else None,
latitude=result["location"]["address"]["coordinate"].get("lat") if able_to_get_lat_long else None,
longitude=result["location"]["address"]["coordinate"].get("lon") if able_to_get_lat_long else None,
address=self._parse_address(result, search_type="general_search"),
description=self._parse_description(result),
neighborhoods=self._parse_neighborhoods(result),
county=result["location"]["county"].get("name") if result["location"]["county"] else None,
fips_code=result["location"]["county"].get("fips_code") if result["location"]["county"] else None,
days_on_mls=self.calculate_days_on_mls(result),
agents=prop_details.get("agents"),
brokers=prop_details.get("brokers"),
nearby_schools=prop_details.get("schools"),
assessed_value=prop_details.get("assessed_value"),
estimated_value=prop_details.get("estimated_value"),
)
return realty_property
with ThreadPoolExecutor(max_workers=self.NUM_PROPERTY_WORKERS) as executor: for future in as_completed(futures):
futures = [ result = future.result()
executor.submit(process_property, result) for result in response_json["data"][search_key]["results"] if result:
] properties.append(result)
else:
for future in as_completed(futures): properties = properties_list
result = future.result()
if result:
properties.append(result)
return { return {
"total": response_json["data"][search_key]["total"], "total": total_properties,
"properties": properties, "properties": properties,
} }
@@ -605,17 +328,7 @@ class RealtorScraper(Scraper):
if location_type == "address": if location_type == "address":
if not self.radius: #: single address search, non comps if not self.radius: #: single address search, non comps
property_id = location_info["mpr_id"] property_id = location_info["mpr_id"]
search_variables |= {"property_id": property_id} return self.handle_home(property_id)
gql_results = self.general_search(search_variables, search_type=search_type)
if gql_results["total"] == 0:
listing_id = self.get_latest_listing_id(property_id)
if listing_id is None:
return self.handle_address(property_id)
else:
return self.handle_listing(listing_id)
else:
return gql_results["properties"]
else: #: general search, comps (radius) else: #: general search, comps (radius)
if not location_info.get("centroid"): if not location_info.get("centroid"):
@@ -638,6 +351,7 @@ class RealtorScraper(Scraper):
"county": location_info.get("county"), "county": location_info.get("county"),
"state_code": location_info.get("state_code"), "state_code": location_info.get("state_code"),
"postal_code": location_info.get("postal_code"), "postal_code": location_info.get("postal_code"),
} }
if self.foreclosure: if self.foreclosure:
@@ -654,7 +368,11 @@ class RealtorScraper(Scraper):
variables=search_variables | {"offset": i}, variables=search_variables | {"offset": i},
search_type=search_type, search_type=search_type,
) )
for i in range(200, min(total, 10000), 200) for i in range(
self.DEFAULT_PAGE_SIZE,
min(total, self.limit),
self.DEFAULT_PAGE_SIZE,
)
] ]
for future in as_completed(futures): for future in as_completed(futures):
@@ -662,199 +380,41 @@ class RealtorScraper(Scraper):
return homes return homes
def get_prop_details(self, property_id: str) -> dict:
if not self.extra_property_data:
@retry(
retry=retry_if_exception_type(JSONDecodeError),
wait=wait_exponential(min=4, max=10),
stop=stop_after_attempt(3),
)
def get_bulk_prop_details(self, property_ids: list[str]) -> dict:
"""
Fetch extra property details for multiple properties in a single GraphQL query.
Returns a map of property_id to its details.
"""
if not self.extra_property_data or not property_ids:
return {} return {}
#: TODO: migrate "advertisers" and "estimates" to general query property_ids = list(set(property_ids))
query = """query GetHome($property_id: ID!) { # Construct the bulk query
home(property_id: $property_id) { fragments = "\n".join(
__typename f'home_{property_id}: home(property_id: {property_id}) {{ ...HomeData }}'
for property_id in property_ids
)
query = f"""{HOME_FRAGMENT}
query GetHomes {{
{fragments}
}}"""
advertisers { response = self.session.post(self.SEARCH_GQL_URL, json={"query": query})
__typename
type
name
email
phones { number type ext primary }
}
consumer_advertisers {
name
phone
href
type
}
nearbySchools: nearby_schools(radius: 5.0, limit_per_level: 3) {
__typename schools { district { __typename id name } }
}
taxHistory: tax_history { __typename tax year assessment { __typename building land total } }
estimates {
__typename
currentValues: current_values {
__typename
source { __typename type name }
estimate
estimateHigh: estimate_high
estimateLow: estimate_low
date
isBestHomeValue: isbest_homevalue
}
}
}
}"""
variables = {"property_id": property_id}
response = self.session.post(self.PROPERTY_GQL, json={"query": query, "variables": variables})
data = response.json() data = response.json()
def get_key(keys: list): if "data" not in data:
try: return {}
value = data
for key in keys:
value = value[key]
return value or {} properties = data["data"]
except (KeyError, TypeError, IndexError): return {data.replace('home_', ''): properties[data] for data in properties if properties[data]}
return {}
agents = get_key(["data", "home", "advertisers"])
advertisers = get_key(["data", "home", "consumer_advertisers"])
schools = get_key(["data", "home", "nearbySchools", "schools"])
assessed_value = get_key(["data", "home", "taxHistory", 0, "assessment", "total"])
estimated_value = get_key(["data", "home", "estimates", "currentValues", 0, "estimate"])
agents = [Agent(name=ad["name"], email=ad["email"], phones=ad["phones"]) for ad in agents]
brokers = [
Broker(name=ad["name"], phone=ad["phone"], website=ad["href"])
for ad in advertisers
if ad.get("type") != "Agent"
]
schools = [school["district"]["name"] for school in schools if school["district"].get("name")]
return {
"agents": agents if agents else None,
"brokers": brokers if brokers else None,
"schools": schools if schools else None,
"assessed_value": assessed_value if assessed_value else None,
"estimated_value": estimated_value if estimated_value else None,
}
@staticmethod
def _parse_neighborhoods(result: dict) -> Optional[str]:
neighborhoods_list = []
neighborhoods = result["location"].get("neighborhoods", [])
if neighborhoods:
for neighborhood in neighborhoods:
name = neighborhood.get("name")
if name:
neighborhoods_list.append(name)
return ", ".join(neighborhoods_list) if neighborhoods_list else None
@staticmethod
def handle_none_safely(address_part):
if address_part is None:
return ""
return address_part
@staticmethod
def _parse_address(result: dict, search_type):
if search_type == "general_search":
address = result["location"]["address"]
else:
address = result["address"]
return Address(
full_line=address.get("line"),
street=" ".join(
part
for part in [
address.get("street_number"),
address.get("street_direction"),
address.get("street_name"),
address.get("street_suffix"),
]
if part is not None
).strip(),
unit=address["unit"],
city=address["city"],
state=address["state_code"],
zip=address["postal_code"],
)
@staticmethod
def _parse_description(result: dict) -> Description:
description_data = result.get("description", {})
if description_data is None or not isinstance(description_data, dict):
description_data = {}
style = description_data.get("type", "")
if style is not None:
style = style.upper()
primary_photo = ""
if result and "primary_photo" in result:
primary_photo_info = result["primary_photo"]
if primary_photo_info and "href" in primary_photo_info:
primary_photo_href = primary_photo_info["href"]
primary_photo = primary_photo_href.replace("s.jpg", "od-w480_h360_x2.webp?w=1080&q=75")
return Description(
primary_photo=primary_photo,
alt_photos=RealtorScraper.process_alt_photos(result.get("photos", [])),
style=PropertyType.__getitem__(style) if style and style in PropertyType.__members__ else None,
beds=description_data.get("beds"),
baths_full=description_data.get("baths_full"),
baths_half=description_data.get("baths_half"),
sqft=description_data.get("sqft"),
lot_sqft=description_data.get("lot_sqft"),
sold_price=(
description_data.get("sold_price")
if result.get("last_sold_date") or result["list_price"] != description_data.get("sold_price")
else None
), #: has a sold date or list and sold price are different
year_built=description_data.get("year_built"),
garage=description_data.get("garage"),
stories=description_data.get("stories"),
text=description_data.get("text"),
)
@staticmethod
def calculate_days_on_mls(result: dict) -> Optional[int]:
list_date_str = result.get("list_date")
list_date = datetime.strptime(list_date_str.split("T")[0], "%Y-%m-%d") if list_date_str else None
last_sold_date_str = result.get("last_sold_date")
last_sold_date = datetime.strptime(last_sold_date_str, "%Y-%m-%d") if last_sold_date_str else None
today = datetime.now()
if list_date:
if result["status"] == "sold":
if last_sold_date:
days = (last_sold_date - list_date).days
if days >= 0:
return days
elif result["status"] in ("for_sale", "for_rent"):
days = (today - list_date).days
if days >= 0:
return days
@staticmethod
def process_alt_photos(photos_info):
try:
alt_photos = []
if photos_info:
for photo_info in photos_info:
href = photo_info.get("href", "")
alt_photo_href = href.replace("s.jpg", "od-w480_h360_x2.webp?w=1080&q=75")
alt_photos.append(alt_photo_href)
return alt_photos
except Exception:
pass

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,279 @@
"""
Parsers for realtor.com data processing
"""
from datetime import datetime
from typing import Optional
from ..models import Address, Description, PropertyType
def parse_open_houses(open_houses_data: list[dict] | None) -> list[dict] | None:
"""Parse open houses data and convert date strings to datetime objects"""
if not open_houses_data:
return None
parsed_open_houses = []
for oh in open_houses_data:
parsed_oh = oh.copy()
# Parse start_date and end_date
if parsed_oh.get("start_date"):
try:
parsed_oh["start_date"] = datetime.fromisoformat(parsed_oh["start_date"].replace("Z", "+00:00"))
except (ValueError, AttributeError):
parsed_oh["start_date"] = None
if parsed_oh.get("end_date"):
try:
parsed_oh["end_date"] = datetime.fromisoformat(parsed_oh["end_date"].replace("Z", "+00:00"))
except (ValueError, AttributeError):
parsed_oh["end_date"] = None
parsed_open_houses.append(parsed_oh)
return parsed_open_houses
def parse_units(units_data: list[dict] | None) -> list[dict] | None:
"""Parse units data and convert date strings to datetime objects"""
if not units_data:
return None
parsed_units = []
for unit in units_data:
parsed_unit = unit.copy()
# Parse availability date
if parsed_unit.get("availability") and parsed_unit["availability"].get("date"):
try:
parsed_unit["availability"]["date"] = datetime.fromisoformat(parsed_unit["availability"]["date"].replace("Z", "+00:00"))
except (ValueError, AttributeError):
parsed_unit["availability"]["date"] = None
parsed_units.append(parsed_unit)
return parsed_units
def parse_tax_record(tax_record_data: dict | None) -> dict | None:
"""Parse tax record data and convert date strings to datetime objects"""
if not tax_record_data:
return None
parsed_tax_record = tax_record_data.copy()
# Parse last_update_date
if parsed_tax_record.get("last_update_date"):
try:
parsed_tax_record["last_update_date"] = datetime.fromisoformat(parsed_tax_record["last_update_date"].replace("Z", "+00:00"))
except (ValueError, AttributeError):
parsed_tax_record["last_update_date"] = None
return parsed_tax_record
def parse_current_estimates(estimates_data: list[dict] | None) -> list[dict] | None:
"""Parse current estimates data and convert date strings to datetime objects"""
if not estimates_data:
return None
parsed_estimates = []
for estimate in estimates_data:
parsed_estimate = estimate.copy()
# Parse date
if parsed_estimate.get("date"):
try:
parsed_estimate["date"] = datetime.fromisoformat(parsed_estimate["date"].replace("Z", "+00:00"))
except (ValueError, AttributeError):
parsed_estimate["date"] = None
# Parse source information
if parsed_estimate.get("source"):
source_data = parsed_estimate["source"]
parsed_estimate["source"] = {
"type": source_data.get("type"),
"name": source_data.get("name")
}
parsed_estimates.append(parsed_estimate)
return parsed_estimates
def parse_estimates(estimates_data: dict | None) -> dict | None:
"""Parse estimates data and convert date strings to datetime objects"""
if not estimates_data:
return None
parsed_estimates = estimates_data.copy()
# Parse current_values (which is aliased as currentValues in GraphQL)
current_values = parsed_estimates.get("currentValues") or parsed_estimates.get("current_values")
if current_values:
parsed_current_values = []
for estimate in current_values:
parsed_estimate = estimate.copy()
# Parse date
if parsed_estimate.get("date"):
try:
parsed_estimate["date"] = datetime.fromisoformat(parsed_estimate["date"].replace("Z", "+00:00"))
except (ValueError, AttributeError):
parsed_estimate["date"] = None
# Parse source information
if parsed_estimate.get("source"):
source_data = parsed_estimate["source"]
parsed_estimate["source"] = {
"type": source_data.get("type"),
"name": source_data.get("name")
}
# Convert GraphQL aliases to Pydantic field names
if "estimateHigh" in parsed_estimate:
parsed_estimate["estimate_high"] = parsed_estimate.pop("estimateHigh")
if "estimateLow" in parsed_estimate:
parsed_estimate["estimate_low"] = parsed_estimate.pop("estimateLow")
if "isBestHomeValue" in parsed_estimate:
parsed_estimate["is_best_home_value"] = parsed_estimate.pop("isBestHomeValue")
parsed_current_values.append(parsed_estimate)
parsed_estimates["current_values"] = parsed_current_values
# Remove the GraphQL alias if it exists
if "currentValues" in parsed_estimates:
del parsed_estimates["currentValues"]
return parsed_estimates
def parse_neighborhoods(result: dict) -> Optional[str]:
"""Parse neighborhoods from location data"""
neighborhoods_list = []
neighborhoods = result["location"].get("neighborhoods", [])
if neighborhoods:
for neighborhood in neighborhoods:
name = neighborhood.get("name")
if name:
neighborhoods_list.append(name)
return ", ".join(neighborhoods_list) if neighborhoods_list else None
def handle_none_safely(address_part):
"""Handle None values safely for address parts"""
if address_part is None:
return ""
return address_part
def parse_address(result: dict, search_type: str) -> Address:
"""Parse address data from result"""
if search_type == "general_search":
address = result["location"]["address"]
else:
address = result["address"]
return Address(
full_line=address.get("line"),
street=" ".join(
part
for part in [
address.get("street_number"),
address.get("street_direction"),
address.get("street_name"),
address.get("street_suffix"),
]
if part is not None
).strip(),
unit=address["unit"],
city=address["city"],
state=address["state_code"],
zip=address["postal_code"],
# Additional address fields
street_direction=address.get("street_direction"),
street_number=address.get("street_number"),
street_name=address.get("street_name"),
street_suffix=address.get("street_suffix"),
)
def parse_description(result: dict) -> Description | None:
"""Parse description data from result"""
if not result:
return None
description_data = result.get("description", {})
if description_data is None or not isinstance(description_data, dict):
description_data = {}
style = description_data.get("type", "")
if style is not None:
style = style.upper()
primary_photo = None
if (primary_photo_info := result.get("primary_photo")) and (
primary_photo_href := primary_photo_info.get("href")
):
primary_photo = primary_photo_href.replace("s.jpg", "od-w480_h360_x2.webp?w=1080&q=75")
return Description(
primary_photo=primary_photo,
alt_photos=process_alt_photos(result.get("photos", [])),
style=(PropertyType.__getitem__(style) if style and style in PropertyType.__members__ else None),
beds=description_data.get("beds"),
baths_full=description_data.get("baths_full"),
baths_half=description_data.get("baths_half"),
sqft=description_data.get("sqft"),
lot_sqft=description_data.get("lot_sqft"),
sold_price=(
result.get("last_sold_price") or description_data.get("sold_price")
if result.get("last_sold_date") or result["list_price"] != description_data.get("sold_price")
else None
), #: has a sold date or list and sold price are different
year_built=description_data.get("year_built"),
garage=description_data.get("garage"),
stories=description_data.get("stories"),
text=description_data.get("text"),
# Additional description fields
name=description_data.get("name"),
type=description_data.get("type"),
)
def calculate_days_on_mls(result: dict) -> Optional[int]:
"""Calculate days on MLS from result data"""
list_date_str = result.get("list_date")
list_date = datetime.strptime(list_date_str.split("T")[0], "%Y-%m-%d") if list_date_str else None
last_sold_date_str = result.get("last_sold_date")
last_sold_date = datetime.strptime(last_sold_date_str, "%Y-%m-%d") if last_sold_date_str else None
today = datetime.now()
if list_date:
if result["status"] == "sold":
if last_sold_date:
days = (last_sold_date - list_date).days
if days >= 0:
return days
elif result["status"] in ("for_sale", "for_rent"):
days = (today - list_date).days
if days >= 0:
return days
def process_alt_photos(photos_info: list[dict]) -> list[str] | None:
"""Process alternative photos from photos info"""
if not photos_info:
return None
return [
photo_info["href"].replace("s.jpg", "od-w480_h360_x2.webp?w=1080&q=75")
for photo_info in photos_info
if photo_info.get("href")
]

View File

@@ -0,0 +1,224 @@
"""
Processors for realtor.com property data processing
"""
from datetime import datetime
from typing import Optional
from ..models import (
Property,
ListingType,
Agent,
Broker,
Builder,
Advertisers,
Office,
ReturnType
)
from .parsers import (
parse_open_houses,
parse_units,
parse_tax_record,
parse_current_estimates,
parse_estimates,
parse_neighborhoods,
parse_address,
parse_description,
calculate_days_on_mls,
process_alt_photos
)
def process_advertisers(advertisers: list[dict] | None) -> Advertisers | None:
"""Process advertisers data from GraphQL response"""
if not advertisers:
return None
def _parse_fulfillment_id(fulfillment_id: str | None) -> str | None:
return fulfillment_id if fulfillment_id and fulfillment_id != "0" else None
processed_advertisers = Advertisers()
for advertiser in advertisers:
advertiser_type = advertiser.get("type")
if advertiser_type == "seller": #: agent
processed_advertisers.agent = Agent(
uuid=_parse_fulfillment_id(advertiser.get("fulfillment_id")),
nrds_id=advertiser.get("nrds_id"),
mls_set=advertiser.get("mls_set"),
name=advertiser.get("name"),
email=advertiser.get("email"),
phones=advertiser.get("phones"),
state_license=advertiser.get("state_license"),
)
if advertiser.get("broker") and advertiser["broker"].get("name"): #: has a broker
processed_advertisers.broker = Broker(
uuid=_parse_fulfillment_id(advertiser["broker"].get("fulfillment_id")),
name=advertiser["broker"].get("name"),
)
if advertiser.get("office"): #: has an office
processed_advertisers.office = Office(
uuid=_parse_fulfillment_id(advertiser["office"].get("fulfillment_id")),
mls_set=advertiser["office"].get("mls_set"),
name=advertiser["office"].get("name"),
email=advertiser["office"].get("email"),
phones=advertiser["office"].get("phones"),
)
if advertiser_type == "community": #: could be builder
if advertiser.get("builder"):
processed_advertisers.builder = Builder(
uuid=_parse_fulfillment_id(advertiser["builder"].get("fulfillment_id")),
name=advertiser["builder"].get("name"),
)
return processed_advertisers
def process_property(result: dict, mls_only: bool = False, extra_property_data: bool = False,
exclude_pending: bool = False, listing_type: ListingType = ListingType.FOR_SALE,
get_key_func=None, process_extra_property_details_func=None) -> Property | None:
"""Process property data from GraphQL response"""
mls = result["source"].get("id") if "source" in result and isinstance(result["source"], dict) else None
if not mls and mls_only:
return None
able_to_get_lat_long = (
result
and result.get("location")
and result["location"].get("address")
and result["location"]["address"].get("coordinate")
)
is_pending = result["flags"].get("is_pending")
is_contingent = result["flags"].get("is_contingent")
if (is_pending or is_contingent) and (exclude_pending and listing_type != ListingType.PENDING):
return None
property_id = result["property_id"]
prop_details = process_extra_property_details_func(result) if extra_property_data and process_extra_property_details_func else {}
property_estimates_root = result.get("current_estimates") or result.get("estimates", {}).get("currentValues")
estimated_value = get_key_func(property_estimates_root, [0, "estimate"]) if get_key_func else None
advertisers = process_advertisers(result.get("advertisers"))
realty_property = Property(
mls=mls,
mls_id=(
result["source"].get("listing_id")
if "source" in result and isinstance(result["source"], dict)
else None
),
property_url=result["href"],
property_id=property_id,
listing_id=result.get("listing_id"),
permalink=result.get("permalink"),
status=("PENDING" if is_pending else "CONTINGENT" if is_contingent else result["status"].upper()),
list_price=result["list_price"],
list_price_min=result["list_price_min"],
list_price_max=result["list_price_max"],
list_date=(datetime.fromisoformat(result["list_date"].split("T")[0]) if result.get("list_date") else None),
prc_sqft=result.get("price_per_sqft"),
last_sold_date=(datetime.fromisoformat(result["last_sold_date"]) if result.get("last_sold_date") else None),
pending_date=(datetime.fromisoformat(result["pending_date"].split("T")[0]) if result.get("pending_date") else None),
new_construction=result["flags"].get("is_new_construction") is True,
hoa_fee=(result["hoa"]["fee"] if result.get("hoa") and isinstance(result["hoa"], dict) else None),
latitude=(result["location"]["address"]["coordinate"].get("lat") if able_to_get_lat_long else None),
longitude=(result["location"]["address"]["coordinate"].get("lon") if able_to_get_lat_long else None),
address=parse_address(result, search_type="general_search"),
description=parse_description(result),
neighborhoods=parse_neighborhoods(result),
county=(result["location"]["county"].get("name") if result["location"]["county"] else None),
fips_code=(result["location"]["county"].get("fips_code") if result["location"]["county"] else None),
days_on_mls=calculate_days_on_mls(result),
nearby_schools=prop_details.get("schools"),
assessed_value=prop_details.get("assessed_value"),
estimated_value=estimated_value if estimated_value else None,
advertisers=advertisers,
tax=prop_details.get("tax"),
tax_history=prop_details.get("tax_history"),
# Additional fields from GraphQL
mls_status=result.get("mls_status"),
last_sold_price=result.get("last_sold_price"),
tags=result.get("tags"),
details=result.get("details"),
open_houses=parse_open_houses(result.get("open_houses")),
pet_policy=result.get("pet_policy"),
units=parse_units(result.get("units")),
monthly_fees=result.get("monthly_fees"),
one_time_fees=result.get("one_time_fees"),
parking=result.get("parking"),
terms=result.get("terms"),
popularity=result.get("popularity"),
tax_record=parse_tax_record(result.get("tax_record")),
parcel_info=result.get("location", {}).get("parcel"),
current_estimates=parse_current_estimates(result.get("current_estimates")),
estimates=parse_estimates(result.get("estimates")),
photos=result.get("photos"),
flags=result.get("flags"),
)
return realty_property
def process_extra_property_details(result: dict, get_key_func=None) -> dict:
"""Process extra property details from GraphQL response"""
if get_key_func:
schools = get_key_func(result, ["nearbySchools", "schools"])
assessed_value = get_key_func(result, ["taxHistory", 0, "assessment", "total"])
tax_history = get_key_func(result, ["taxHistory"])
else:
nearby_schools = result.get("nearbySchools")
schools = nearby_schools.get("schools", []) if nearby_schools else []
tax_history_data = result.get("taxHistory", [])
assessed_value = tax_history_data[0]["assessment"]["total"] if tax_history_data and tax_history_data[0].get("assessment", {}).get("total") else None
tax_history = tax_history_data
if schools:
schools = [school["district"]["name"] for school in schools if school["district"].get("name")]
# Process tax history
latest_tax = None
processed_tax_history = None
if tax_history and isinstance(tax_history, list):
tax_history = sorted(tax_history, key=lambda x: x.get("year", 0), reverse=True)
if tax_history and "tax" in tax_history[0]:
latest_tax = tax_history[0]["tax"]
processed_tax_history = []
for entry in tax_history:
if "year" in entry and "tax" in entry:
processed_entry = {
"year": entry["year"],
"tax": entry["tax"],
}
if "assessment" in entry and isinstance(entry["assessment"], dict):
processed_entry["assessment"] = {
"building": entry["assessment"].get("building"),
"land": entry["assessment"].get("land"),
"total": entry["assessment"].get("total"),
}
processed_tax_history.append(processed_entry)
return {
"schools": schools if schools else None,
"assessed_value": assessed_value if assessed_value else None,
"tax": latest_tax,
"tax_history": processed_tax_history,
}
def get_key(data: dict, keys: list):
"""Get nested key from dictionary safely"""
try:
value = data
for key in keys:
value = value[key]
return value or {}
except (KeyError, TypeError, IndexError):
return {}

View File

@@ -0,0 +1,300 @@
_SEARCH_HOMES_DATA_BASE = """{
pending_date
listing_id
property_id
href
permalink
list_date
status
mls_status
last_sold_price
last_sold_date
list_price
list_price_max
list_price_min
price_per_sqft
tags
open_houses {
start_date
end_date
description
time_zone
dst
href
methods
}
details {
category
text
parent_category
}
pet_policy {
cats
dogs
dogs_small
dogs_large
__typename
}
units {
availability {
date
__typename
}
description {
baths_consolidated
baths
beds
sqft
__typename
}
photos(https: true) {
title
href
tags {
label
}
}
list_price
__typename
}
flags {
is_contingent
is_pending
is_new_construction
}
description {
type
sqft
beds
baths_full
baths_half
lot_sqft
year_built
garage
type
name
stories
text
}
source {
id
listing_id
}
hoa {
fee
}
location {
address {
street_direction
street_number
street_name
street_suffix
line
unit
city
state_code
postal_code
coordinate {
lon
lat
}
}
county {
name
fips_code
}
neighborhoods {
name
}
}
tax_record {
cl_id
public_record_id
last_update_date
apn
tax_parcel_id
}
primary_photo(https: true) {
href
}
photos(https: true) {
title
href
tags {
label
}
}
advertisers {
email
broker {
name
fulfillment_id
}
type
name
fulfillment_id
builder {
name
fulfillment_id
}
phones {
ext
primary
type
number
}
office {
name
email
fulfillment_id
href
phones {
number
type
primary
ext
}
mls_set
}
corporation {
specialties
name
bio
href
fulfillment_id
}
mls_set
nrds_id
state_license
rental_corporation {
fulfillment_id
}
rental_management {
name
href
fulfillment_id
}
}
"""
HOME_FRAGMENT = """
fragment HomeData on Home {
property_id
nearbySchools: nearby_schools(radius: 5.0, limit_per_level: 3) {
__typename schools { district { __typename id name } }
}
popularity {
periods {
clicks_total
views_total
dwell_time_mean
dwell_time_median
leads_total
shares_total
saves_total
last_n_days
}
}
location {
parcel {
parcel_id
}
}
taxHistory: tax_history { __typename tax year assessment { __typename building land total } }
monthly_fees {
description
display_amount
}
one_time_fees {
description
display_amount
}
parking {
unassigned_space_rent
assigned_spaces_available
description
assigned_space_rent
}
terms {
text
category
}
}
"""
HOMES_DATA = """%s
nearbySchools: nearby_schools(radius: 5.0, limit_per_level: 3) {
__typename schools { district { __typename id name } }
}
monthly_fees {
description
display_amount
}
one_time_fees {
description
display_amount
}
popularity {
periods {
clicks_total
views_total
dwell_time_mean
dwell_time_median
leads_total
shares_total
saves_total
last_n_days
}
}
location {
parcel {
parcel_id
}
}
parking {
unassigned_space_rent
assigned_spaces_available
description
assigned_space_rent
}
terms {
text
category
}
taxHistory: tax_history { __typename tax year assessment { __typename building land total } }
estimates {
__typename
currentValues: current_values {
__typename
source { __typename type name }
estimate
estimateHigh: estimate_high
estimateLow: estimate_low
date
isBestHomeValue: isbest_homevalue
}
}
}""" % _SEARCH_HOMES_DATA_BASE
SEARCH_HOMES_DATA = """%s
current_estimates {
__typename
source {
__typename
type
name
}
estimate
estimateHigh: estimate_high
estimateLow: estimate_low
date
isBestHomeValue: isbest_homevalue
}
}""" % _SEARCH_HOMES_DATA_BASE
GENERAL_RESULTS_QUERY = """{
count
total
results %s
}""" % SEARCH_HOMES_DATA

View File

@@ -1,16 +1,21 @@
from __future__ import annotations from __future__ import annotations
import pandas as pd import pandas as pd
from datetime import datetime from datetime import datetime
from .core.scrapers.models import Property, ListingType, Agent from .core.scrapers.models import Property, ListingType, Advertisers
from .exceptions import InvalidListingType, InvalidDate from .exceptions import InvalidListingType, InvalidDate
ordered_properties = [ ordered_properties = [
"property_url", "property_url",
"property_id",
"listing_id",
"permalink",
"mls", "mls",
"mls_id", "mls_id",
"status", "status",
"mls_status",
"text", "text",
"style", "style",
"formatted_address",
"full_street_line", "full_street_line",
"street", "street",
"unit", "unit",
@@ -24,11 +29,18 @@ ordered_properties = [
"year_built", "year_built",
"days_on_mls", "days_on_mls",
"list_price", "list_price",
"list_price_min",
"list_price_max",
"list_date", "list_date",
"pending_date",
"sold_price", "sold_price",
"last_sold_date", "last_sold_date",
"last_sold_price",
"assessed_value", "assessed_value",
"estimated_value", "estimated_value",
"tax",
"tax_history",
"new_construction",
"lot_sqft", "lot_sqft",
"price_per_sqft", "price_per_sqft",
"latitude", "latitude",
@@ -39,54 +51,92 @@ ordered_properties = [
"stories", "stories",
"hoa_fee", "hoa_fee",
"parking_garage", "parking_garage",
"agent", "agent_id",
"agent_name",
"agent_email", "agent_email",
"agent_phones", "agent_phones",
"broker", "agent_mls_set",
"broker_phone", "agent_nrds_id",
"broker_website", "broker_id",
"broker_name",
"builder_id",
"builder_name",
"office_id",
"office_mls_set",
"office_name",
"office_email",
"office_phones",
"nearby_schools", "nearby_schools",
"primary_photo", "primary_photo",
"alt_photos", "alt_photos"
] ]
def process_result(result: Property) -> pd.DataFrame: def process_result(result: Property) -> pd.DataFrame:
prop_data = {prop: None for prop in ordered_properties} prop_data = {prop: None for prop in ordered_properties}
prop_data.update(result.__dict__) prop_data.update(result.model_dump())
if "address" in prop_data: if "address" in prop_data and prop_data["address"]:
address_data = prop_data["address"] address_data = prop_data["address"]
prop_data["full_street_line"] = address_data.full_line prop_data["full_street_line"] = address_data.get("full_line")
prop_data["street"] = address_data.street prop_data["street"] = address_data.get("street")
prop_data["unit"] = address_data.unit prop_data["unit"] = address_data.get("unit")
prop_data["city"] = address_data.city prop_data["city"] = address_data.get("city")
prop_data["state"] = address_data.state prop_data["state"] = address_data.get("state")
prop_data["zip_code"] = address_data.zip prop_data["zip_code"] = address_data.get("zip")
prop_data["formatted_address"] = address_data.get("formatted_address")
if "agents" in prop_data: if "advertisers" in prop_data and prop_data.get("advertisers"):
agents: list[Agent] | None = prop_data["agents"] advertiser_data = prop_data["advertisers"]
if agents: if advertiser_data.get("agent"):
prop_data["agent"] = agents[0].name agent_data = advertiser_data["agent"]
prop_data["agent_email"] = agents[0].email prop_data["agent_id"] = agent_data.get("uuid")
prop_data["agent_phones"] = agents[0].phones prop_data["agent_name"] = agent_data.get("name")
prop_data["agent_email"] = agent_data.get("email")
prop_data["agent_phones"] = agent_data.get("phones")
prop_data["agent_mls_set"] = agent_data.get("mls_set")
prop_data["agent_nrds_id"] = agent_data.get("nrds_id")
if "brokers" in prop_data: if advertiser_data.get("broker"):
brokers = prop_data["brokers"] broker_data = advertiser_data["broker"]
if brokers: prop_data["broker_id"] = broker_data.get("uuid")
prop_data["broker"] = brokers[0].name prop_data["broker_name"] = broker_data.get("name")
prop_data["broker_phone"] = brokers[0].phone
prop_data["broker_website"] = brokers[0].website if advertiser_data.get("builder"):
builder_data = advertiser_data["builder"]
prop_data["builder_id"] = builder_data.get("uuid")
prop_data["builder_name"] = builder_data.get("name")
if advertiser_data.get("office"):
office_data = advertiser_data["office"]
prop_data["office_id"] = office_data.get("uuid")
prop_data["office_name"] = office_data.get("name")
prop_data["office_email"] = office_data.get("email")
prop_data["office_phones"] = office_data.get("phones")
prop_data["office_mls_set"] = office_data.get("mls_set")
prop_data["price_per_sqft"] = prop_data["prc_sqft"] prop_data["price_per_sqft"] = prop_data["prc_sqft"]
prop_data["nearby_schools"] = filter(None, prop_data["nearby_schools"]) if prop_data["nearby_schools"] else None prop_data["nearby_schools"] = filter(None, prop_data["nearby_schools"]) if prop_data["nearby_schools"] else None
prop_data["nearby_schools"] = ", ".join(set(prop_data["nearby_schools"])) if prop_data["nearby_schools"] else None prop_data["nearby_schools"] = ", ".join(set(prop_data["nearby_schools"])) if prop_data["nearby_schools"] else None
# Convert datetime objects to strings for CSV
for date_field in ["list_date", "pending_date", "last_sold_date"]:
if prop_data.get(date_field):
prop_data[date_field] = prop_data[date_field].strftime("%Y-%m-%d") if hasattr(prop_data[date_field], 'strftime') else prop_data[date_field]
# Convert HttpUrl objects to strings for CSV
if prop_data.get("property_url"):
prop_data["property_url"] = str(prop_data["property_url"])
description = result.description description = result.description
if description: if description:
prop_data["primary_photo"] = description.primary_photo prop_data["primary_photo"] = str(description.primary_photo) if description.primary_photo else None
prop_data["alt_photos"] = ", ".join(description.alt_photos) if description.alt_photos else None prop_data["alt_photos"] = ", ".join(str(url) for url in description.alt_photos) if description.alt_photos else None
prop_data["style"] = description.style if isinstance(description.style, str) else description.style.value if description.style else None prop_data["style"] = (
description.style
if isinstance(description.style, str)
else description.style.value if description.style else None
)
prop_data["beds"] = description.beds prop_data["beds"] = description.beds
prop_data["full_baths"] = description.baths_full prop_data["full_baths"] = description.baths_full
prop_data["half_baths"] = description.baths_half prop_data["half_baths"] = description.baths_half
@@ -110,7 +160,7 @@ def validate_input(listing_type: str) -> None:
def validate_dates(date_from: str | None, date_to: str | None) -> None: def validate_dates(date_from: str | None, date_to: str | None) -> None:
if (date_from is not None and date_to is None) or (date_from is None and date_to is not None): if isinstance(date_from, str) != isinstance(date_to, str):
raise InvalidDate("Both date_from and date_to must be provided.") raise InvalidDate("Both date_from and date_to must be provided.")
if date_from and date_to: if date_from and date_to:
@@ -122,3 +172,10 @@ def validate_dates(date_from: str | None, date_to: str | None) -> None:
raise InvalidDate("date_to must be after date_from.") raise InvalidDate("date_to must be after date_from.")
except ValueError: except ValueError:
raise InvalidDate(f"Invalid date format or range") raise InvalidDate(f"Invalid date format or range")
def validate_limit(limit: int) -> None:
#: 1 -> 10000 limit
if limit is not None and (limit < 1 or limit > 10000):
raise ValueError("Property limit must be between 1 and 10,000.")

1066
poetry.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -1,6 +1,6 @@
[tool.poetry] [tool.poetry]
name = "homeharvest" name = "homeharvest"
version = "0.3.32" version = "0.5.0"
description = "Real estate scraping library" description = "Real estate scraping library"
authors = ["Zachary Hampton <zachary@bunsly.com>", "Cullen Watson <cullen@bunsly.com>"] authors = ["Zachary Hampton <zachary@bunsly.com>", "Cullen Watson <cullen@bunsly.com>"]
homepage = "https://github.com/Bunsly/HomeHarvest" homepage = "https://github.com/Bunsly/HomeHarvest"
@@ -11,9 +11,10 @@ homeharvest = "homeharvest.cli:main"
[tool.poetry.dependencies] [tool.poetry.dependencies]
python = ">=3.9,<3.13" python = ">=3.9,<3.13"
requests = "^2.31.0" requests = "^2.32.4"
pandas = "^2.1.1" pandas = "^2.3.1"
pydantic = "^2.7.4" pydantic = "^2.11.7"
tenacity = "^9.1.2"
[tool.poetry.group.dev.dependencies] [tool.poetry.group.dev.dependencies]

View File

@@ -1,4 +1,5 @@
from homeharvest import scrape_property from homeharvest import scrape_property, Property
import pandas as pd
def test_realtor_pending_or_contingent(): def test_realtor_pending_or_contingent():
@@ -105,8 +106,12 @@ def test_realtor():
location="2530 Al Lipscomb Way", location="2530 Al Lipscomb Way",
listing_type="for_sale", listing_type="for_sale",
), ),
scrape_property(location="Phoenix, AZ", listing_type="for_rent"), #: does not support "city, state, USA" format scrape_property(
scrape_property(location="Dallas, TX", listing_type="sold"), #: does not support "city, state, USA" format location="Phoenix, AZ", listing_type="for_rent", limit=1000
), #: does not support "city, state, USA" format
scrape_property(
location="Dallas, TX", listing_type="sold", limit=1000
), #: does not support "city, state, USA" format
scrape_property(location="85281"), scrape_property(location="85281"),
] ]
@@ -114,10 +119,13 @@ def test_realtor():
def test_realtor_city(): def test_realtor_city():
results = scrape_property( results = scrape_property(location="Atlanta, GA", listing_type="for_sale", limit=1000)
location="Atlanta, GA",
listing_type="for_sale", assert results is not None and len(results) > 0
)
def test_realtor_land():
results = scrape_property(location="Atlanta, GA", listing_type="for_sale", property_type=["land"], limit=1000)
assert results is not None and len(results) > 0 assert results is not None and len(results) > 0
@@ -127,6 +135,7 @@ def test_realtor_bad_address():
location="abceefg ju098ot498hh9", location="abceefg ju098ot498hh9",
listing_type="for_sale", listing_type="for_sale",
) )
if len(bad_results) == 0: if len(bad_results) == 0:
assert True assert True
@@ -140,18 +149,23 @@ def test_realtor_foreclosed():
def test_realtor_agent(): def test_realtor_agent():
scraped = scrape_property(location="Detroit, MI", listing_type="for_sale") scraped = scrape_property(location="Detroit, MI", listing_type="for_sale", limit=1000, extra_property_data=False)
assert scraped["agent"].nunique() > 1 assert scraped["agent_name"].nunique() > 1
def test_realtor_without_extra_details(): def test_realtor_without_extra_details():
results = [ results = [
scrape_property( scrape_property(
location="15509 N 172nd Dr, Surprise, AZ 85388", location="00741",
listing_type="sold",
limit=10,
extra_property_data=False, extra_property_data=False,
), ),
scrape_property( scrape_property(
location="15509 N 172nd Dr, Surprise, AZ 85388", location="00741",
listing_type="sold",
limit=10,
extra_property_data=True,
), ),
] ]
@@ -182,6 +196,180 @@ def test_style_value_error():
location="Alaska, AK", location="Alaska, AK",
listing_type="sold", listing_type="sold",
extra_property_data=False, extra_property_data=False,
limit=1000,
) )
assert results is not None and len(results) > 0 assert results is not None and len(results) > 0
def test_primary_image_error():
results = scrape_property(
location="Spokane, PA",
listing_type="for_rent", # or (for_sale, for_rent, pending)
past_days=360,
radius=3,
extra_property_data=False,
)
assert results is not None and len(results) > 0
def test_limit():
over_limit = 876
extra_params = {"limit": over_limit}
over_results = scrape_property(
location="Waddell, AZ",
listing_type="for_sale",
**extra_params,
)
assert over_results is not None and len(over_results) <= over_limit
under_limit = 1
under_results = scrape_property(
location="Waddell, AZ",
listing_type="for_sale",
limit=under_limit,
)
assert under_results is not None and len(under_results) == under_limit
def test_apartment_list_price():
results = scrape_property(
location="Spokane, WA",
listing_type="for_rent", # or (for_sale, for_rent, pending)
extra_property_data=False,
)
assert results is not None
results = results[results["style"] == "APARTMENT"]
#: get percentage of results with atleast 1 of any column not none, list_price, list_price_min, list_price_max
assert (
len(results[results[["list_price", "list_price_min", "list_price_max"]].notnull().any(axis=1)]) / len(results)
> 0.5
)
def test_phone_number_matching():
searches = [
scrape_property(
location="Phoenix, AZ",
listing_type="for_sale",
limit=100,
),
scrape_property(
location="Phoenix, AZ",
listing_type="for_sale",
limit=100,
),
]
assert all([search is not None for search in searches])
#: random row
row = searches[0][searches[0]["agent_phones"].notnull()].sample()
#: find matching row
matching_row = searches[1].loc[searches[1]["property_url"] == row["property_url"].values[0]]
#: assert phone numbers are the same
assert row["agent_phones"].values[0] == matching_row["agent_phones"].values[0]
def test_return_type():
results = {
"pandas": [scrape_property(location="Surprise, AZ", listing_type="for_rent", limit=100)],
"pydantic": [scrape_property(location="Surprise, AZ", listing_type="for_rent", limit=100, return_type="pydantic")],
"raw": [
scrape_property(location="Surprise, AZ", listing_type="for_rent", limit=100, return_type="raw"),
scrape_property(location="66642", listing_type="for_rent", limit=100, return_type="raw"),
],
}
assert all(isinstance(result, pd.DataFrame) for result in results["pandas"])
assert all(isinstance(result[0], Property) for result in results["pydantic"])
assert all(isinstance(result[0], dict) for result in results["raw"])
def test_has_open_house():
address_result = scrape_property("1 Hawthorne St Unit 12F, San Francisco, CA 94105", return_type="raw")
assert address_result[0]["open_houses"] is not None #: has open house data from address search
zip_code_result = scrape_property("94105", return_type="raw")
address_from_zip_result = list(filter(lambda row: row["property_id"] == '1264014746', zip_code_result))
assert address_from_zip_result[0]["open_houses"] is not None #: has open house data from general search
def test_return_type_consistency():
"""Test that return_type works consistently between general and address searches"""
# Test configurations - different search types
test_locations = [
("Dallas, TX", "general"), # General city search
("75201", "zip"), # ZIP code search
("2530 Al Lipscomb Way", "address") # Address search
]
for location, search_type in test_locations:
# Test all return types for each search type
pandas_result = scrape_property(
location=location,
listing_type="for_sale",
limit=3,
return_type="pandas"
)
pydantic_result = scrape_property(
location=location,
listing_type="for_sale",
limit=3,
return_type="pydantic"
)
raw_result = scrape_property(
location=location,
listing_type="for_sale",
limit=3,
return_type="raw"
)
# Validate pandas return type
assert isinstance(pandas_result, pd.DataFrame), f"pandas result should be DataFrame for {search_type}"
assert len(pandas_result) > 0, f"pandas result should not be empty for {search_type}"
required_columns = ["property_id", "property_url", "list_price", "status", "formatted_address"]
for col in required_columns:
assert col in pandas_result.columns, f"Missing column {col} in pandas result for {search_type}"
# Validate pydantic return type
assert isinstance(pydantic_result, list), f"pydantic result should be list for {search_type}"
assert len(pydantic_result) > 0, f"pydantic result should not be empty for {search_type}"
for item in pydantic_result:
assert isinstance(item, Property), f"pydantic items should be Property objects for {search_type}"
assert item.property_id is not None, f"property_id should not be None for {search_type}"
# Validate raw return type
assert isinstance(raw_result, list), f"raw result should be list for {search_type}"
assert len(raw_result) > 0, f"raw result should not be empty for {search_type}"
for item in raw_result:
assert isinstance(item, dict), f"raw items should be dict for {search_type}"
assert "property_id" in item, f"raw items should have property_id for {search_type}"
assert "href" in item, f"raw items should have href for {search_type}"
# Cross-validate that different return types return related data
pandas_ids = set(pandas_result["property_id"].tolist())
pydantic_ids = set(prop.property_id for prop in pydantic_result)
raw_ids = set(item["property_id"] for item in raw_result)
# All return types should have some properties
assert len(pandas_ids) > 0, f"pandas should return properties for {search_type}"
assert len(pydantic_ids) > 0, f"pydantic should return properties for {search_type}"
assert len(raw_ids) > 0, f"raw should return properties for {search_type}"