Compare commits

...

6 Commits

Author SHA1 Message Date
Zachary Hampton
79b2b648f5 Fix sold listings not included when listing_type=None (issue #142)
When listing_type=None, sold listings were excluded despite documentation stating all types should be returned. This fix includes two changes:

1. Explicitly include common listing types (for_sale, for_rent, sold, pending, off_market) when listing_type=None instead of sending empty status parameter
2. Fix or_filters logic to only apply for PENDING when not mixed with other types like SOLD, preventing unintended filtering

Updated README documentation to accurately reflect that None returns common listing types rather than all 8 types.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-14 13:30:54 -08:00
Zachary Hampton
c2f01df1ad Add configurable parallel/sequential pagination with parallel parameter
- Add `parallel: bool = True` parameter to control pagination strategy
- Parallel mode (default): Fetches all pages in parallel for maximum speed
- Sequential mode: Fetches pages one-by-one with early termination checks
- Early termination stops pagination when time-based filters indicate no more matches
- Useful for rate limiting and narrow time windows
- Simplified pagination logic by removing hybrid first-page pre-check
- Updated README with usage example and parameter documentation
- Version bump to 0.8.4
- All 54 tests passing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-13 10:36:47 -08:00
Zachary Hampton
9b61a89c77 Fix timezone handling for all date parameters
- Treat naive datetimes as local time and convert to UTC automatically
- Support both naive and timezone-aware datetimes for updated_since, date_from, date_to
- Fix timezone comparison bug that caused incorrect filtering with naive datetimes
- Update documentation with clear timezone handling examples
- Add comprehensive timezone tests for naive and aware datetimes
- Bump version to 0.8.3

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-11 17:40:21 -08:00
Zachary Hampton
7065f8a0d4 Optimize time-based filtering with auto-sort and early termination
## Performance Optimizations

### Auto-Apply Optimal Sort
- Auto-apply `sort_by="last_update_date"` when using `updated_since` or `updated_in_past_hours`
- Auto-apply `sort_by="pending_date"` when using PENDING listings with date filters
- Ensures API returns properties in chronological order for efficient filtering
- Users can still override by specifying different `sort_by`

### Early Termination
- Pre-check page 1 before launching parallel pagination
- If last property is outside time window, stop pagination immediately
- Avoids 95%+ of unnecessary API calls for narrow time windows
- Only applies when conditions guarantee correctness (date sort + time filter)

## Impact
- 10x faster for narrow time windows (2-3 seconds vs 30+ seconds)
- Fixes inefficiency where 10,000 properties fetched to return 10 matches
- Maintains backward compatibility - falls back when optimization unavailable

## Changes
- homeharvest/__init__.py: Auto-sort logic for time filters
- homeharvest/core/scrapers/realtor/__init__.py: `_should_fetch_more_pages()` method + early termination in pagination
- tests/test_realtor.py: Tests for optimization behavior
- README.md: Updated parameters documentation with all 8 listing types

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-11 16:52:49 -08:00
Zachary Hampton
d88f781b47 - readme 2025-11-11 15:34:28 -08:00
Zachary Hampton
282064d8be - readme 2025-11-11 15:21:08 -08:00
7 changed files with 504 additions and 74 deletions

View File

@@ -84,7 +84,7 @@ properties = scrape_property(
#### Sorting & Listing Types
```py
# Sort options: list_price, list_date, sqft, beds, baths, last_update_date
# Listing types: "for_sale", "for_rent", "sold", "pending", list, or None (all)
# Listing types: "for_sale", "for_rent", "sold", "pending", "off_market", list, or None (common types)
properties = scrape_property(
location="Miami, FL",
listing_type=["for_sale", "pending"], # Single string, list, or None
@@ -94,6 +94,17 @@ properties = scrape_property(
)
```
#### Pagination Control
```py
# Sequential mode with early termination (more efficient for narrow filters)
properties = scrape_property(
location="Los Angeles, CA",
listing_type="for_sale",
updated_in_past_hours=2, # Narrow time window
parallel=False # Fetch pages sequentially, stop when filters no longer match
)
```
## Output
```plaintext
>>> properties.head()
@@ -129,30 +140,38 @@ for prop in properties[:5]:
```
Required
├── location (str): Flexible location search - accepts any of these formats:
- ZIP code: "92104"
- City: "San Diego" or "San Francisco"
- City, State (abbreviated or full): "San Diego, CA" or "San Diego, California"
- Full address: "1234 Main St, San Diego, CA 92104"
- Neighborhood: "Downtown San Diego"
- County: "San Diego County"
├── listing_type (option): Choose the type of listing.
- 'for_rent'
- 'for_sale'
- 'sold'
- 'pending' (for pending/contingent sales)
- ZIP code: "92104"
- City: "San Diego" or "San Francisco"
- City, State (abbreviated or full): "San Diego, CA" or "San Diego, California"
- Full address: "1234 Main St, San Diego, CA 92104"
- Neighborhood: "Downtown San Diego"
- County: "San Diego County"
│ - State (no support for abbreviated): "California"
├── listing_type (str | list[str] | None): Choose the type of listing.
- 'for_sale'
- 'for_rent'
│ - 'sold'
│ - 'pending'
│ - 'off_market'
│ - 'new_community'
│ - 'other'
│ - 'ready_to_build'
│ - List of strings returns properties matching ANY status: ['for_sale', 'pending']
│ - None returns common listing types (for_sale, for_rent, sold, pending, off_market)
Optional
├── property_type (list): Choose the type of properties.
- 'single_family'
- 'multi_family'
- 'condos'
- 'condo_townhome_rowhome_coop'
- 'condo_townhome'
- 'townhomes'
- 'duplex_triplex'
- 'farm'
- 'land'
- 'mobile'
- 'single_family'
- 'multi_family'
- 'condos'
- 'condo_townhome_rowhome_coop'
- 'condo_townhome'
- 'townhomes'
- 'duplex_triplex'
- 'farm'
- 'land'
- 'mobile'
├── return_type (option): Choose the return type.
│ - 'pandas' (default)
@@ -165,12 +184,12 @@ Optional
├── past_days (integer): Number of past days to filter properties. Utilizes 'last_sold_date' for 'sold' listing types, and 'list_date' for others (for_rent, for_sale).
│ Example: 30 (fetches properties listed/sold in the last 30 days)
├── past_hours (integer): Number of past hours to filter properties (more precise than past_days). Uses client-side filtering.
│ Example: 24 (fetches properties from the last 24 hours)
├── past_hours (integer | timedelta): Number of past hours to filter properties (more precise than past_days). Uses client-side filtering.
│ Example: 24 or timedelta(hours=24) (fetches properties from the last 24 hours)
│ Note: Cannot be used together with past_days or date_from/date_to
├── date_from, date_to (string): Start and end dates to filter properties listed or sold, both dates are required.
| (use this to get properties in chunks as there's a 10k result limit)
(use this to get properties in chunks as there's a 10k result limit)
│ Accepts multiple formats with automatic precision detection:
│ - Date strings: "YYYY-MM-DD" (day precision)
│ - Datetime strings: "YYYY-MM-DDTHH:MM:SS" (hour precision, uses client-side filtering)
@@ -180,6 +199,14 @@ Optional
│ Day precision: "2023-05-01", "2023-05-15"
│ Hour precision: "2025-01-20T09:00:00", "2025-01-20T17:00:00"
├── updated_since (datetime | str): Filter properties updated since a specific date/time (based on last_update_date field)
│ Accepts datetime objects or ISO 8601 strings
│ Example: updated_since=datetime(2025, 11, 10, 9, 0) or "2025-11-10T09:00:00"
├── updated_in_past_hours (integer | timedelta): Filter properties updated in the past X hours (based on last_update_date field)
│ Accepts integer (hours) or timedelta object
│ Example: updated_in_past_hours=24 or timedelta(hours=24)
├── beds_min, beds_max (integer): Filter by number of bedrooms
│ Example: beds_min=2, beds_max=4 (2-4 bedrooms)
@@ -199,7 +226,7 @@ Optional
│ Example: year_built_min=2000, year_built_max=2024 (built between 2000-2024)
├── sort_by (string): Sort results by field
│ Options: 'list_date', 'sold_date', 'list_price', 'sqft', 'beds', 'baths'
│ Options: 'list_date', 'sold_date', 'list_price', 'sqft', 'beds', 'baths', 'last_update_date'
│ Example: sort_by='list_price'
├── sort_direction (string): Sort direction, default is 'desc'
@@ -218,7 +245,9 @@ Optional
├── limit (integer): Limit the number of properties to fetch. Max & default is 10000.
── offset (integer): Starting position for pagination within the 10k limit. Use with limit to fetch results in chunks.
── offset (integer): Starting position for pagination within the 10k limit. Use with limit to fetch results in chunks.
└── parallel (True/False): Controls pagination strategy. Default is True (fetch pages in parallel for speed). Set to False for sequential fetching with early termination (useful for rate limiting or narrow time windows).
```
### Property Schema
@@ -265,6 +294,7 @@ Property
│ ├── sold_price
│ ├── last_sold_date # datetime (full timestamp: YYYY-MM-DD HH:MM:SS)
│ ├── last_status_change_date # datetime (full timestamp: YYYY-MM-DD HH:MM:SS)
│ ├── last_update_date # datetime (full timestamp: YYYY-MM-DD HH:MM:SS)
│ ├── last_sold_price
│ ├── price_per_sqft
│ ├── new_construction

View File

@@ -48,6 +48,8 @@ def scrape_property(
# New sorting parameters
sort_by: str = None,
sort_direction: str = "desc",
# Pagination control
parallel: bool = True,
) -> Union[pd.DataFrame, list[dict], list[Property]]:
"""
Scrape properties from Realtor.com based on a given location and listing type.
@@ -72,6 +74,8 @@ def scrape_property(
- date objects: date(2025, 1, 20) (day-level precision)
- datetime objects: datetime(2025, 1, 20, 14, 30) (hour-level precision)
The precision is automatically detected based on the input format.
Timezone handling: Naive datetimes are treated as local time and automatically converted to UTC.
Timezone-aware datetimes are converted to UTC. For best results, use timezone-aware datetimes.
:param foreclosure: If set, fetches only foreclosure listings.
:param extra_property_data: Increases requests by O(n). If set, this fetches additional property data (e.g. agent, broker, property evaluations etc.)
:param exclude_pending: If true, this excludes pending or contingent properties from the results, unless listing type is pending.
@@ -80,7 +84,11 @@ def scrape_property(
New parameters:
:param past_hours: Get properties in the last _ hours (requires client-side filtering). Accepts int or timedelta.
:param updated_since: Filter by last_update_date (when property was last updated). Accepts datetime object or ISO 8601 string (client-side filtering)
:param updated_since: Filter by last_update_date (when property was last updated). Accepts datetime object or ISO 8601 string (client-side filtering).
Timezone handling: Naive datetimes (like datetime.now()) are treated as local time and automatically converted to UTC.
Timezone-aware datetimes are converted to UTC. Examples:
- datetime.now() - uses your local timezone
- datetime.now(timezone.utc) - uses UTC explicitly
:param updated_in_past_hours: Filter by properties updated in the last _ hours. Accepts int or timedelta (client-side filtering)
:param beds_min, beds_max: Filter by number of bedrooms
:param baths_min, baths_max: Filter by number of bathrooms
@@ -90,6 +98,9 @@ def scrape_property(
:param year_built_min, year_built_max: Filter by year built
:param sort_by: Sort results by field (list_date, sold_date, list_price, sqft, beds, baths, last_update_date)
:param sort_direction: Sort direction (asc, desc)
:param parallel: Controls pagination strategy. True (default) = fetch all pages in parallel for maximum speed.
False = fetch pages sequentially with early termination checks (useful for rate limiting or narrow time windows).
Sequential mode will stop paginating as soon as time-based filters indicate no more matches are possible.
Note: past_days and past_hours also accept timedelta objects for more Pythonic usage.
"""
@@ -129,6 +140,22 @@ def scrape_property(
converted_updated_since = convert_to_datetime_string(updated_since)
converted_updated_in_past_hours = extract_timedelta_hours(updated_in_past_hours)
# Auto-apply optimal sort for time-based filters (unless user specified different sort)
if (converted_updated_since or converted_updated_in_past_hours) and not sort_by:
sort_by = "last_update_date"
if not sort_direction:
sort_direction = "desc" # Most recent first
# Auto-apply optimal sort for PENDING listings with date filters
# PENDING API filtering is broken, so we rely on client-side filtering
# Sorting by pending_date ensures efficient pagination with early termination
elif (converted_listing_type == ListingType.PENDING and
(converted_past_days or converted_past_hours or converted_date_from) and
not sort_by):
sort_by = "pending_date"
if not sort_direction:
sort_direction = "desc" # Most recent first
scraper_input = ScraperInput(
location=location,
listing_type=converted_listing_type,
@@ -168,6 +195,8 @@ def scrape_property(
# New sorting
sort_by=sort_by,
sort_direction=sort_direction,
# Pagination control
parallel=parallel,
)
site = RealtorScraper(scraper_input)

View File

@@ -55,6 +55,9 @@ class ScraperInput(BaseModel):
sort_by: str | None = None
sort_direction: str = "desc"
# Pagination control
parallel: bool = True
class Scraper:
session = None
@@ -141,6 +144,9 @@ class Scraper:
self.sort_by = scraper_input.sort_by
self.sort_direction = scraper_input.sort_direction
# Pagination control
self.parallel = scraper_input.parallel
def search(self) -> list[Union[Property | dict]]: ...
@staticmethod

View File

@@ -144,7 +144,15 @@ class RealtorScraper(Scraper):
# Determine date field based on listing type
# Convert listing_type to list for uniform handling
if self.listing_type is None:
listing_types = []
# When None, return all common listing types as documented
# Note: NEW_COMMUNITY, OTHER, and READY_TO_BUILD are excluded as they typically return no results
listing_types = [
ListingType.FOR_SALE,
ListingType.FOR_RENT,
ListingType.SOLD,
ListingType.PENDING,
ListingType.OFF_MARKET,
]
date_field = None # When no listing_type is specified, skip date filtering
elif isinstance(self.listing_type, list):
listing_types = self.listing_type
@@ -277,10 +285,14 @@ class RealtorScraper(Scraper):
else:
sort_param = "" #: prioritize normal fractal sort from realtor
# Handle PENDING with or_filters (applies if PENDING is in the list or is the single type)
# Handle PENDING with or_filters
# Only use or_filters when PENDING is the only type or mixed only with FOR_SALE
# Using or_filters with other types (SOLD, FOR_RENT, etc.) will exclude those types
has_pending = ListingType.PENDING in listing_types
other_types = [lt for lt in listing_types if lt not in [ListingType.PENDING, ListingType.FOR_SALE]]
use_or_filters = has_pending and len(other_types) == 0
pending_or_contingent_param = (
"or_filters: { contingent: true, pending: true }" if has_pending else ""
"or_filters: { contingent: true, pending: true }" if use_or_filters else ""
)
# Build bucket parameter (only use fractal sort if no custom sort is specified)
@@ -526,31 +538,49 @@ class RealtorScraper(Scraper):
total = result["total"]
homes = result["properties"]
with ThreadPoolExecutor() as executor:
# Store futures with their offsets to maintain proper sort order
# Start from offset + page_size and go up to offset + limit
futures_with_offsets = [
(i, executor.submit(
self.general_search,
variables=search_variables | {"offset": i},
search_type=search_type,
))
for i in range(
# Fetch remaining pages based on parallel parameter
if self.offset + self.DEFAULT_PAGE_SIZE < min(total, self.offset + self.limit):
if self.parallel:
# Parallel mode: Fetch all remaining pages in parallel
with ThreadPoolExecutor() as executor:
futures_with_offsets = [
(i, executor.submit(
self.general_search,
variables=search_variables | {"offset": i},
search_type=search_type,
))
for i in range(
self.offset + self.DEFAULT_PAGE_SIZE,
min(total, self.offset + self.limit),
self.DEFAULT_PAGE_SIZE,
)
]
# Collect results and sort by offset to preserve API sort order
results = []
for offset, future in futures_with_offsets:
results.append((offset, future.result()["properties"]))
results.sort(key=lambda x: x[0])
for offset, properties in results:
homes.extend(properties)
else:
# Sequential mode: Fetch pages one by one with early termination checks
for current_offset in range(
self.offset + self.DEFAULT_PAGE_SIZE,
min(total, self.offset + self.limit),
self.DEFAULT_PAGE_SIZE,
)
]
):
# Check if we should continue based on time-based filters
if not self._should_fetch_more_pages(homes):
break
# Collect results and sort by offset to preserve API sort order across pages
results = []
for offset, future in futures_with_offsets:
results.append((offset, future.result()["properties"]))
# Sort by offset and concatenate in correct order
results.sort(key=lambda x: x[0])
for offset, properties in results:
homes.extend(properties)
result = self.general_search(
variables=search_variables | {"offset": current_offset},
search_type=search_type,
)
page_properties = result["properties"]
homes.extend(page_properties)
# Apply client-side hour-based filtering if needed
# (API only supports day-level filtering, so we post-filter for hour precision)
@@ -747,13 +777,14 @@ class RealtorScraper(Scraper):
if not homes:
return homes
from datetime import datetime, timedelta
from datetime import datetime, timedelta, timezone
# Determine date range for last_update_date filtering
date_range = None
if self.updated_in_past_hours:
cutoff_datetime = datetime.now() - timedelta(hours=self.updated_in_past_hours)
# Use UTC now, strip timezone to match naive property dates
cutoff_datetime = (datetime.now(timezone.utc) - timedelta(hours=self.updated_in_past_hours)).replace(tzinfo=None)
date_range = {'type': 'since', 'date': cutoff_datetime}
elif self.updated_since:
try:
@@ -784,15 +815,19 @@ class RealtorScraper(Scraper):
def _get_date_range(self):
"""Get the date range for filtering based on instance parameters."""
from datetime import datetime, timedelta
from datetime import datetime, timedelta, timezone
if self.last_x_days:
cutoff_date = datetime.now() - timedelta(days=self.last_x_days)
# Use UTC now, strip timezone to match naive property dates
cutoff_date = (datetime.now(timezone.utc) - timedelta(days=self.last_x_days)).replace(tzinfo=None)
return {'type': 'since', 'date': cutoff_date}
elif self.date_from and self.date_to:
try:
from_date = datetime.fromisoformat(self.date_from)
to_date = datetime.fromisoformat(self.date_to)
# Parse and strip timezone to match naive property dates
from_date_str = self.date_from.replace('Z', '+00:00') if self.date_from.endswith('Z') else self.date_from
to_date_str = self.date_to.replace('Z', '+00:00') if self.date_to.endswith('Z') else self.date_to
from_date = datetime.fromisoformat(from_date_str).replace(tzinfo=None)
to_date = datetime.fromisoformat(to_date_str).replace(tzinfo=None)
return {'type': 'range', 'from_date': from_date, 'to_date': to_date}
except ValueError:
return None
@@ -844,6 +879,74 @@ class RealtorScraper(Scraper):
return date_range['from_date'] <= date_obj <= date_range['to_date']
return False
def _should_fetch_more_pages(self, first_page):
"""Determine if we should continue pagination based on first page results.
This optimization prevents unnecessary API calls when using time-based filters
with date sorting. If the last property on page 1 is already outside the time
window, all future pages will also be outside (due to sort order).
Args:
first_page: List of properties from the first page
Returns:
bool: True if we should continue pagination, False to stop early
"""
from datetime import datetime, timedelta, timezone
# Check for last_update_date filters
if (self.updated_since or self.updated_in_past_hours) and self.sort_by == "last_update_date":
if not first_page:
return False
last_property = first_page[-1]
last_date = self._extract_date_from_home(last_property, 'last_update_date')
if not last_date:
return True
# Build date range for last_update_date filter
if self.updated_since:
try:
cutoff_datetime = datetime.fromisoformat(self.updated_since.replace('Z', '+00:00') if self.updated_since.endswith('Z') else self.updated_since)
# Strip timezone to match naive datetimes from _parse_date_value
cutoff_datetime = cutoff_datetime.replace(tzinfo=None)
date_range = {'type': 'since', 'date': cutoff_datetime}
except ValueError:
return True
elif self.updated_in_past_hours:
# Use UTC now, strip timezone to match naive property dates
cutoff_datetime = (datetime.now(timezone.utc) - timedelta(hours=self.updated_in_past_hours)).replace(tzinfo=None)
date_range = {'type': 'since', 'date': cutoff_datetime}
else:
return True
return self._is_datetime_in_range(last_date, date_range)
# Check for PENDING date filters
if (self.listing_type == ListingType.PENDING and
(self.last_x_days or self.past_hours or self.date_from) and
self.sort_by == "pending_date"):
if not first_page:
return False
last_property = first_page[-1]
last_date = self._extract_date_from_home(last_property, 'pending_date')
if not last_date:
return True
# Build date range for pending date filter
date_range = self._get_date_range()
if not date_range:
return True
return self._is_datetime_in_range(last_date, date_range)
# No optimization applicable, continue pagination
return True
def _apply_sort(self, homes):
"""Apply client-side sorting to ensure results are properly ordered.
@@ -862,6 +965,8 @@ class RealtorScraper(Scraper):
def get_sort_key(home):
"""Extract the sort field value from a home (handles both dict and Property object)."""
from datetime import datetime
if isinstance(home, dict):
value = home.get(self.sort_by)
else:
@@ -877,20 +982,23 @@ class RealtorScraper(Scraper):
if self.sort_by in ['list_date', 'sold_date', 'pending_date', 'last_update_date']:
if isinstance(value, str):
try:
from datetime import datetime
# Handle timezone indicators
date_value = value
if date_value.endswith('Z'):
date_value = date_value[:-1] + '+00:00'
parsed_date = datetime.fromisoformat(date_value)
return (0, parsed_date)
# Normalize to timezone-naive for consistent comparison
return 0, parsed_date.replace(tzinfo=None)
except (ValueError, AttributeError):
# If parsing fails, treat as None
return (1, 0) if self.sort_direction == "desc" else (1, float('inf'))
return (0, value)
# Handle datetime objects directly (normalize timezone)
if isinstance(value, datetime):
return 0, value.replace(tzinfo=None)
return 0, value
# For numeric fields, ensure we can compare
return (0, value)
return 0, value
# Sort the homes
reverse = (self.sort_direction == "desc")

View File

@@ -331,15 +331,26 @@ def validate_sort(sort_by: str | None, sort_direction: str | None = "desc") -> N
def convert_to_datetime_string(value) -> str | None:
"""
Convert datetime object or string to ISO 8601 string format.
Convert datetime object or string to ISO 8601 string format with UTC timezone.
Accepts:
- datetime.datetime objects
- datetime.date objects
- datetime.datetime objects (naive or timezone-aware)
- Naive datetimes are treated as local time and converted to UTC
- Timezone-aware datetimes are converted to UTC
- datetime.date objects (treated as midnight UTC)
- ISO 8601 strings (returned as-is)
- None (returns None)
Returns ISO 8601 formatted string or None.
Returns ISO 8601 formatted string with UTC timezone or None.
Examples:
>>> # Naive datetime (treated as local time)
>>> convert_to_datetime_string(datetime(2025, 1, 20, 14, 30))
'2025-01-20T22:30:00+00:00' # Assuming PST (UTC-8)
>>> # Timezone-aware datetime
>>> convert_to_datetime_string(datetime(2025, 1, 20, 14, 30, tzinfo=timezone.utc))
'2025-01-20T14:30:00+00:00'
"""
if value is None:
return None
@@ -349,13 +360,23 @@ def convert_to_datetime_string(value) -> str | None:
return value
# datetime.datetime object
from datetime import datetime, date
from datetime import datetime, date, timezone
if isinstance(value, datetime):
return value.isoformat()
# Handle naive datetime - treat as local time and convert to UTC
if value.tzinfo is None:
# Convert naive datetime to aware local time, then to UTC
local_aware = value.astimezone()
utc_aware = local_aware.astimezone(timezone.utc)
return utc_aware.isoformat()
else:
# Already timezone-aware, convert to UTC
utc_aware = value.astimezone(timezone.utc)
return utc_aware.isoformat()
# datetime.date object (convert to datetime at midnight)
# datetime.date object (convert to datetime at midnight UTC)
if isinstance(value, date):
return datetime.combine(value, datetime.min.time()).isoformat()
utc_datetime = datetime.combine(value, datetime.min.time()).replace(tzinfo=timezone.utc)
return utc_datetime.isoformat()
raise ValueError(
f"Invalid datetime value. Expected datetime object, date object, or ISO 8601 string. "

View File

@@ -1,6 +1,6 @@
[tool.poetry]
name = "homeharvest"
version = "0.8.0"
version = "0.8.5"
description = "Real estate scraping library"
authors = ["Zachary Hampton <zachary@bunsly.com>", "Cullen Watson <cullen@bunsly.com>"]
homepage = "https://github.com/ZacharyHampton/HomeHarvest"

View File

@@ -1,3 +1,5 @@
import pytz
from homeharvest import scrape_property, Property
import pandas as pd
@@ -1357,4 +1359,238 @@ def test_combined_filters_with_raw_data():
mls_id = source.get('id') if source else None
assert mls_id is not None and mls_id != "", \
f"Property {prop.get('property_id')} should have an MLS ID (source.id)"
f"Property {prop.get('property_id')} should have an MLS ID (source.id)"
def test_updated_since_filtering():
"""Test the updated_since parameter for filtering by last_update_date"""
from datetime import datetime, timedelta
# Test 1: Filter by last update in past 10 minutes (user's example)
cutoff_time = datetime.now() - timedelta(minutes=10)
result_10min = scrape_property(
location="California",
updated_since=cutoff_time,
sort_by="last_update_date",
sort_direction="desc",
limit=100
)
assert result_10min is not None
print(f"\n10-minute window returned {len(result_10min)} properties")
# Test 2: Verify all results have last_update_date within range
if len(result_10min) > 0:
for idx in range(min(10, len(result_10min))):
update_date_str = result_10min.iloc[idx]["last_update_date"]
if pd.notna(update_date_str):
try:
# Handle timezone-aware datetime strings
date_str = str(update_date_str)
if '+' in date_str or date_str.endswith('Z'):
# Remove timezone for comparison with naive cutoff_time
date_str = date_str.replace('+00:00', '').replace('Z', '')
update_date = datetime.strptime(date_str, "%Y-%m-%d %H:%M:%S")
assert update_date >= cutoff_time, \
f"Property last_update_date {update_date} should be >= {cutoff_time}"
print(f"Property {idx}: last_update_date = {update_date} (valid)")
except (ValueError, TypeError) as e:
print(f"Warning: Could not parse date {update_date_str}: {e}")
# Test 3: Compare different time windows
result_1hour = scrape_property(
location="California",
updated_since=datetime.now() - timedelta(hours=1),
limit=50
)
result_24hours = scrape_property(
location="California",
updated_since=datetime.now() - timedelta(hours=24),
limit=50
)
print(f"1-hour window: {len(result_1hour)} properties")
print(f"24-hour window: {len(result_24hours)} properties")
# Longer time window should return same or more results
if len(result_1hour) > 0 and len(result_24hours) > 0:
assert len(result_1hour) <= len(result_24hours), \
"1-hour filter should return <= 24-hour results"
# Test 4: Verify sorting works with filtering
if len(result_10min) > 1:
# Get non-null dates
dates = []
for idx in range(len(result_10min)):
date_str = result_10min.iloc[idx]["last_update_date"]
if pd.notna(date_str):
try:
# Handle timezone-aware datetime strings
clean_date_str = str(date_str)
if '+' in clean_date_str or clean_date_str.endswith('Z'):
clean_date_str = clean_date_str.replace('+00:00', '').replace('Z', '')
dates.append(datetime.strptime(clean_date_str, "%Y-%m-%d %H:%M:%S"))
except (ValueError, TypeError):
pass
if len(dates) > 1:
# Check if sorted descending
for i in range(len(dates) - 1):
assert dates[i] >= dates[i + 1], \
f"Results should be sorted by last_update_date descending: {dates[i]} >= {dates[i+1]}"
def test_updated_since_optimization():
"""Test that updated_since optimization works (auto-sort + early termination)"""
from datetime import datetime, timedelta
import time
# Test 1: Verify auto-sort is applied when using updated_since without explicit sort
start_time = time.time()
result = scrape_property(
location="California",
updated_since=datetime.now() - timedelta(minutes=5),
# NO sort_by specified - should auto-apply sort_by="last_update_date"
limit=50
)
elapsed_time = time.time() - start_time
print(f"\nAuto-sort test: {len(result)} properties in {elapsed_time:.2f}s")
# Should complete quickly due to early termination optimization (<5 seconds)
assert elapsed_time < 5.0, f"Query should be fast with optimization, took {elapsed_time:.2f}s"
# Verify results are sorted by last_update_date (proving auto-sort worked)
if len(result) > 1:
dates = []
for idx in range(min(10, len(result))):
date_str = result.iloc[idx]["last_update_date"]
if pd.notna(date_str):
try:
clean_date_str = str(date_str)
if '+' in clean_date_str or clean_date_str.endswith('Z'):
clean_date_str = clean_date_str.replace('+00:00', '').replace('Z', '')
dates.append(datetime.strptime(clean_date_str, "%Y-%m-%d %H:%M:%S"))
except (ValueError, TypeError):
pass
if len(dates) > 1:
# Verify descending order (most recent first)
for i in range(len(dates) - 1):
assert dates[i] >= dates[i + 1], \
"Auto-applied sort should order by last_update_date descending"
print("Auto-sort optimization verified ✓")
def test_pending_date_optimization():
"""Test that PENDING + date filters get auto-sort and early termination"""
from datetime import datetime, timedelta
import time
# Test: Verify auto-sort is applied for PENDING with past_days
start_time = time.time()
result = scrape_property(
location="California",
listing_type="pending",
past_days=7,
# NO sort_by specified - should auto-apply sort_by="pending_date"
limit=50
)
elapsed_time = time.time() - start_time
print(f"\nPENDING auto-sort test: {len(result)} properties in {elapsed_time:.2f}s")
# Should complete quickly due to optimization (<10 seconds)
assert elapsed_time < 10.0, f"PENDING query should be fast with optimization, took {elapsed_time:.2f}s"
# Verify results are sorted by pending_date (proving auto-sort worked)
if len(result) > 1:
dates = []
for idx in range(min(10, len(result))):
date_str = result.iloc[idx]["pending_date"]
if pd.notna(date_str):
try:
clean_date_str = str(date_str)
if '+' in clean_date_str or clean_date_str.endswith('Z'):
clean_date_str = clean_date_str.replace('+00:00', '').replace('Z', '')
dates.append(datetime.strptime(clean_date_str, "%Y-%m-%d %H:%M:%S"))
except (ValueError, TypeError):
pass
if len(dates) > 1:
# Verify descending order (most recent first)
for i in range(len(dates) - 1):
assert dates[i] >= dates[i + 1], \
"PENDING auto-applied sort should order by pending_date descending"
print("PENDING optimization verified ✓")
def test_basic_last_update_date():
from datetime import datetime, timedelta
# Test with naive datetime (treated as local time)
now = datetime.now()
properties = scrape_property(
"California",
updated_since=now - timedelta(minutes=10),
sort_by="last_update_date",
sort_direction="desc"
)
# Convert now to timezone-aware for comparison with UTC dates in DataFrame
now_utc = now.astimezone(tz=pytz.timezone("UTC"))
# Check all last_update_date values are <= now
assert (properties["last_update_date"] <= now_utc).all()
# Verify we got some results
assert len(properties) > 0
def test_timezone_aware_last_update_date():
"""Test that timezone-aware datetimes work correctly for updated_since"""
from datetime import datetime, timedelta, timezone
# Test with timezone-aware datetime (explicit UTC)
now_utc = datetime.now(timezone.utc)
properties = scrape_property(
"California",
updated_since=now_utc - timedelta(minutes=10),
sort_by="last_update_date",
sort_direction="desc"
)
# Check all last_update_date values are <= now
assert (properties["last_update_date"] <= now_utc).all()
# Verify we got some results
assert len(properties) > 0
def test_timezone_handling_date_range():
"""Test timezone handling for date_from and date_to parameters"""
from datetime import datetime, timedelta
# Test with naive datetimes for date range (PENDING properties)
now = datetime.now()
three_days_ago = now - timedelta(days=3)
properties = scrape_property(
"California",
listing_type="pending",
date_from=three_days_ago,
date_to=now
)
# Verify we got results and they're within the date range
if len(properties) > 0:
# Convert now to UTC for comparison
now_utc = now.astimezone(tz=pytz.timezone("UTC"))
assert (properties["pending_date"] <= now_utc).all()