Compare commits

...

26 Commits

Author SHA1 Message Date
Cullen Watson
19f23c95c4 Merge pull request #43 from Bunsly/add_photos
Add photos
2023-11-24 21:40:34 -06:00
Cullen
4676ec9839 chore: remove test file 2023-11-24 13:42:52 -06:00
Cullen
6dd0b058d3 chore: version 2023-11-24 13:41:46 -06:00
Cullen
a74c1a9950 enh: add photos 2023-11-24 13:40:57 -06:00
Cullen Watson
fa507dbc72 docs: typo 2023-11-20 01:05:10 -06:00
Cullen Watson
5b6a9943cc Merge pull request #42 from Bunsly/street_dirction
fix: add street direction
2023-11-08 16:53:29 -06:00
Cullen Watson
9816defaf3 chore: version 2023-11-08 16:53:05 -06:00
Cullen Watson
f692b438b2 fix: add street direction 2023-11-08 16:52:06 -06:00
Zachary Hampton
30f48f54c8 Update README.md 2023-11-06 22:13:01 -07:00
Cullen Watson
7f86f69610 docs: readme 2023-11-03 18:53:46 -05:00
Cullen Watson
cc64dacdb0 docs: readme - date_from, date_to 2023-11-03 18:52:22 -05:00
Cullen Watson
d3268d8e5a Merge pull request #40 from Bunsly/date_range
Add date_to and date_from params
2023-11-03 18:42:13 -05:00
Cullen Watson
4edad901c5 [enh] date_to and date_from 2023-11-03 18:40:34 -05:00
Zachary Hampton
c597a78191 - None address bug fix 2023-10-18 16:32:43 -07:00
Zachary Hampton
11a7d854f0 - remove pending listings from for_sale 2023-10-18 14:41:41 -07:00
Zachary Hampton
f726548cc6 Update pyproject.toml 2023-10-18 09:35:48 -07:00
Zachary Hampton
fad7d670eb Update README.md 2023-10-18 08:37:42 -07:00
Zachary Hampton
89a6f93c9f Update pyproject.toml 2023-10-18 08:37:26 -07:00
Zachary Hampton
e1090b06e4 Update README.md 2023-10-17 20:22:25 -07:00
Cullen Watson
5036e74b60 Merge branch 'master' of https://github.com/ZacharyHampton/HomeHarvest 2023-10-09 11:30:17 -05:00
Cullen Watson
2cb544bc8d [chore] display clickable URLs in jupyter 2023-10-09 11:28:56 -05:00
Zachary Hampton
68cb365e03 Merge pull request #34 from ZacharyHampton/days_on_mls
[enh] days_on_mls attr
2023-10-09 09:04:59 -07:00
Cullen Watson
23876d5725 [chore] function types 2023-10-09 11:02:51 -05:00
Cullen Watson
b59d55f6b5 [enh] days_on_mls attr 2023-10-09 11:00:36 -05:00
Cullen Watson
3c3adb5f29 [docs] update video 2023-10-05 20:24:23 -05:00
Zachary Hampton
6ede8622cc - pending listing support
- removal of pending_or_contingent param
2023-10-05 11:43:00 -07:00
11 changed files with 333 additions and 131 deletions

View File

@@ -6,9 +6,9 @@
**Not technical?** Try out the web scraping tool on our site at [tryhomeharvest.com](https://tryhomeharvest.com). **Not technical?** Try out the web scraping tool on our site at [tryhomeharvest.com](https://tryhomeharvest.com).
*Looking to build a data-focused software product?* **[Book a call](https://calendly.com/zachary-products/15min)** *to work with us.* *Looking to build a data-focused software product?* **[Book a call](https://bunsly.com)** *to work with us.*
Check out another project we wrote: ***[JobSpy](https://github.com/cullenwatson/JobSpy)** a Python package for job scraping* Check out another project we wrote: ***[JobSpy](https://github.com/Bunsly/JobSpy)** a Python package for job scraping*
## HomeHarvest Features ## HomeHarvest Features
@@ -20,7 +20,7 @@ Check out another project we wrote: ***[JobSpy](https://github.com/cullenwatson/
- **CLI**: For users who prefer command-line operations. - **CLI**: For users who prefer command-line operations.
[Video Guide for HomeHarvest](https://youtu.be/JnV7eR2Ve2o) - _updated for release v0.2.7_ [Video Guide for HomeHarvest](https://youtu.be/J1qgNPgmSLI) - _updated for release v0.3.4_
![homeharvest](https://github.com/ZacharyHampton/HomeHarvest/assets/78247585/b3d5d727-e67b-4a9f-85d8-1e65fd18620a) ![homeharvest](https://github.com/ZacharyHampton/HomeHarvest/assets/78247585/b3d5d727-e67b-4a9f-85d8-1e65fd18620a)
@@ -45,9 +45,12 @@ filename = f"HomeHarvest_{current_timestamp}.csv"
properties = scrape_property( properties = scrape_property(
location="San Diego, CA", location="San Diego, CA",
listing_type="sold", # or (for_sale, for_rent) listing_type="sold", # or (for_sale, for_rent, pending)
past_days=30, # sold in last 30 days - listed in last x days if (for_sale, for_rent) past_days=30, # sold in last 30 days - listed in last 30 days if (for_sale, for_rent)
# pending_or_contingent=True # use on for_sale listings to find pending / contingent listings
# date_from="2023-05-01", # alternative to past_days
# date_to="2023-05-28",
# mls_only=True, # only fetch MLS listings # mls_only=True, # only fetch MLS listings
# proxy="http://user:pass@host:port" # use a proxy to change your IP address # proxy="http://user:pass@host:port" # use a proxy to change your IP address
) )
@@ -58,37 +61,6 @@ properties.to_csv(filename, index=False)
print(properties.head()) print(properties.head())
``` ```
### CLI
```
usage: homeharvest [-l {for_sale,for_rent,sold}] [-o {excel,csv}] [-f FILENAME] [-p PROXY] [-d DAYS] [-r RADIUS] [-m] [-c] location
Home Harvest Property Scraper
positional arguments:
location Location to scrape (e.g., San Francisco, CA)
options:
-l {for_sale,for_rent,sold}, --listing_type {for_sale,for_rent,sold}
Listing type to scrape
-o {excel,csv}, --output {excel,csv}
Output format
-f FILENAME, --filename FILENAME
Name of the output file (without extension)
-p PROXY, --proxy PROXY
Proxy to use for scraping
-d DAYS, --days DAYS Sold/listed in last _ days filter.
-r RADIUS, --radius RADIUS
Get comparable properties within _ (e.g., 0.0) miles. Only applicable for individual addresses.
-m, --mls_only If set, fetches only MLS listings.
-c, --pending_or_contingent
If set, fetches only pending or contingent listings. Only applicable for for_sale listings from general area searches.
```
```bash
homeharvest "San Francisco, CA" -l for_rent -o excel -f HomeHarvest
```
## Output ## Output
```plaintext ```plaintext
@@ -110,6 +82,7 @@ Required
- 'for_rent' - 'for_rent'
- 'for_sale' - 'for_sale'
- 'sold' - 'sold'
- 'pending'
Optional Optional
├── radius (decimal): Radius in miles to find comparable properties based on individual addresses. ├── radius (decimal): Radius in miles to find comparable properties based on individual addresses.
@@ -117,14 +90,47 @@ Optional
├── past_days (integer): Number of past days to filter properties. Utilizes 'last_sold_date' for 'sold' listing types, and 'list_date' for others (for_rent, for_sale). ├── past_days (integer): Number of past days to filter properties. Utilizes 'last_sold_date' for 'sold' listing types, and 'list_date' for others (for_rent, for_sale).
│ Example: 30 (fetches properties listed/sold in the last 30 days) │ Example: 30 (fetches properties listed/sold in the last 30 days)
|
├── pending_or_contingent (True/False): If set, fetches only pending or contingent listings. Only applicable for `for_sale listings` from general area searches. ├── date_from, date_to (string): Start and end dates to filter properties listed or sold, both dates are required.
| (use this to get properties in chunks as there's a 10k result limit)
│ Format for both must be "YYYY-MM-DD".
│ Example: "2023-05-01", "2023-05-15" (fetches properties listed/sold between these dates)
├── mls_only (True/False): If set, fetches only MLS listings (mainly applicable to 'sold' listings) ├── mls_only (True/False): If set, fetches only MLS listings (mainly applicable to 'sold' listings)
└── proxy (string): In format 'http://user:pass@host:port' └── proxy (string): In format 'http://user:pass@host:port'
``` ```
### CLI
```
usage: homeharvest [-l {for_sale,for_rent,sold}] [-o {excel,csv}] [-f FILENAME] [-p PROXY] [-d DAYS] [-r RADIUS] [-m] [-c] location
Home Harvest Property Scraper
positional arguments:
location Location to scrape (e.g., San Francisco, CA)
options:
-l {for_sale,for_rent,sold,pending}, --listing_type {for_sale,for_rent,sold,pending}
Listing type to scrape
-o {excel,csv}, --output {excel,csv}
Output format
-f FILENAME, --filename FILENAME
Name of the output file (without extension)
-p PROXY, --proxy PROXY
Proxy to use for scraping
-d DAYS, --days DAYS Sold/listed in last _ days filter.
-r RADIUS, --radius RADIUS
Get comparable properties within _ (e.g., 0.0) miles. Only applicable for individual addresses.
-m, --mls_only If set, fetches only MLS listings.
```
```bash
homeharvest "San Francisco, CA" -l for_rent -o excel -f HomeHarvest
```
### Property Schema ### Property Schema
```plaintext ```plaintext
Property Property
@@ -152,6 +158,7 @@ Property
│ └── lot_sqft │ └── lot_sqft
├── Property Listing Details: ├── Property Listing Details:
│ ├── days_on_mls
│ ├── list_price │ ├── list_price
│ ├── list_date │ ├── list_date
│ ├── sold_price │ ├── sold_price
@@ -171,7 +178,7 @@ Property
The following exceptions may be raised when using HomeHarvest: The following exceptions may be raised when using HomeHarvest:
- `InvalidListingType` - valid options: `for_sale`, `for_rent`, `sold` - `InvalidListingType` - valid options: `for_sale`, `for_rent`, `sold`
- `NoResultsFound` - no properties found from your search - `InvalidDate` - date_from or date_to is not in the format YYYY-MM-DD
## Frequently Asked Questions ## Frequently Asked Questions

View File

@@ -4,7 +4,9 @@
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"id": "cb48903e-5021-49fe-9688-45cd0bc05d0f", "id": "cb48903e-5021-49fe-9688-45cd0bc05d0f",
"metadata": {}, "metadata": {
"is_executing": true
},
"outputs": [], "outputs": [],
"source": [ "source": [
"from homeharvest import scrape_property\n", "from homeharvest import scrape_property\n",
@@ -84,10 +86,34 @@
"outputs": [], "outputs": [],
"source": [ "source": [
"# check sold properties\n", "# check sold properties\n",
"scrape_property(\n", "properties = scrape_property(\n",
" location=\"90210\",\n", " location=\"90210\",\n",
" listing_type=\"sold\"\n", " listing_type=\"sold\",\n",
")" " past_days=10\n",
")\n",
"display(properties)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "628c1ce2",
"metadata": {
"collapsed": false,
"is_executing": true,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"# display clickable URLs\n",
"from IPython.display import display, HTML\n",
"properties['property_url'] = '<a href=\"' + properties['property_url'] + '\" target=\"_blank\">' + properties['property_url'] + '</a>'\n",
"\n",
"html = properties.to_html(escape=False)\n",
"truncate_width = f'<style>.dataframe td {{ max-width: 200px; overflow: hidden; text-overflow: ellipsis; white-space: nowrap; }}</style>{html}'\n",
"display(HTML(truncate_width))"
] ]
} }
], ],

View File

@@ -1,10 +1,9 @@
import warnings import warnings
import pandas as pd import pandas as pd
from .core.scrapers import ScraperInput from .core.scrapers import ScraperInput
from .utils import process_result, ordered_properties, validate_input from .utils import process_result, ordered_properties, validate_input, validate_dates
from .core.scrapers.realtor import RealtorScraper from .core.scrapers.realtor import RealtorScraper
from .core.scrapers.models import ListingType from .core.scrapers.models import ListingType
from .exceptions import InvalidListingType, NoResultsFound
def scrape_property( def scrape_property(
@@ -13,8 +12,9 @@ def scrape_property(
radius: float = None, radius: float = None,
mls_only: bool = False, mls_only: bool = False,
past_days: int = None, past_days: int = None,
pending_or_contingent: bool = False,
proxy: str = None, proxy: str = None,
date_from: str = None,
date_to: str = None,
) -> pd.DataFrame: ) -> pd.DataFrame:
""" """
Scrape properties from Realtor.com based on a given location and listing type. Scrape properties from Realtor.com based on a given location and listing type.
@@ -23,10 +23,11 @@ def scrape_property(
:param radius: Get properties within _ (e.g. 1.0) miles. Only applicable for individual addresses. :param radius: Get properties within _ (e.g. 1.0) miles. Only applicable for individual addresses.
:param mls_only: If set, fetches only listings with MLS IDs. :param mls_only: If set, fetches only listings with MLS IDs.
:param past_days: Get properties sold or listed (dependent on your listing_type) in the last _ days. :param past_days: Get properties sold or listed (dependent on your listing_type) in the last _ days.
:param pending_or_contingent: If set, fetches only pending or contingent listings. Only applicable for for_sale listings from general area searches. :param date_from, date_to: Get properties sold or listed (dependent on your listing_type) between these dates. format: 2021-01-28
:param proxy: Proxy to use for scraping :param proxy: Proxy to use for scraping
""" """
validate_input(listing_type) validate_input(listing_type)
validate_dates(date_from, date_to)
scraper_input = ScraperInput( scraper_input = ScraperInput(
location=location, location=location,
@@ -35,7 +36,8 @@ def scrape_property(
radius=radius, radius=radius,
mls_only=mls_only, mls_only=mls_only,
last_x_days=past_days, last_x_days=past_days,
pending_or_contingent=pending_or_contingent, date_from=date_from,
date_to=date_to,
) )
site = RealtorScraper(scraper_input) site = RealtorScraper(scraper_input)
@@ -43,7 +45,7 @@ def scrape_property(
properties_dfs = [process_result(result) for result in results] properties_dfs = [process_result(result) for result in results]
if not properties_dfs: if not properties_dfs:
raise NoResultsFound("no results found for the query") return pd.DataFrame()
with warnings.catch_warnings(): with warnings.catch_warnings():
warnings.simplefilter("ignore", category=FutureWarning) warnings.simplefilter("ignore", category=FutureWarning)

View File

@@ -14,7 +14,7 @@ def main():
"--listing_type", "--listing_type",
type=str, type=str,
default="for_sale", default="for_sale",
choices=["for_sale", "for_rent", "sold"], choices=["for_sale", "for_rent", "sold", "pending"],
help="Listing type to scrape", help="Listing type to scrape",
) )
@@ -60,13 +60,6 @@ def main():
help="If set, fetches only MLS listings.", help="If set, fetches only MLS listings.",
) )
parser.add_argument(
"-c",
"--pending_or_contingent",
action="store_true",
help="If set, fetches only pending or contingent listings. Only applicable for for_sale listings from general area searches.",
)
args = parser.parse_args() args = parser.parse_args()
result = scrape_property( result = scrape_property(
@@ -76,7 +69,6 @@ def main():
proxy=args.proxy, proxy=args.proxy,
mls_only=args.mls_only, mls_only=args.mls_only,
past_days=args.days, past_days=args.days,
pending_or_contingent=args.pending_or_contingent,
) )
if not args.filename: if not args.filename:

View File

@@ -11,7 +11,8 @@ class ScraperInput:
mls_only: bool | None = None mls_only: bool | None = None
proxy: str | None = None proxy: str | None = None
last_x_days: int | None = None last_x_days: int | None = None
pending_or_contingent: bool | None = None date_from: str | None = None
date_to: str | None = None
class Scraper: class Scraper:
@@ -37,7 +38,8 @@ class Scraper:
self.radius = scraper_input.radius self.radius = scraper_input.radius
self.last_x_days = scraper_input.last_x_days self.last_x_days = scraper_input.last_x_days
self.mls_only = scraper_input.mls_only self.mls_only = scraper_input.mls_only
self.pending_or_contingent = scraper_input.pending_or_contingent self.date_from = scraper_input.date_from
self.date_to = scraper_input.date_to
def search(self) -> list[Property]: def search(self) -> list[Property]:
... ...

View File

@@ -19,6 +19,7 @@ class SiteName(Enum):
class ListingType(Enum): class ListingType(Enum):
FOR_SALE = "FOR_SALE" FOR_SALE = "FOR_SALE"
FOR_RENT = "FOR_RENT" FOR_RENT = "FOR_RENT"
PENDING = "PENDING"
SOLD = "SOLD" SOLD = "SOLD"
@@ -33,6 +34,8 @@ class Address:
@dataclass @dataclass
class Description: class Description:
primary_photo: str | None = None
alt_photos: list[str] | None = None
style: str | None = None style: str | None = None
beds: int | None = None beds: int | None = None
baths_full: int | None = None baths_full: int | None = None
@@ -58,6 +61,7 @@ class Property:
last_sold_date: str | None = None last_sold_date: str | None = None
prc_sqft: int | None = None prc_sqft: int | None = None
hoa_fee: int | None = None hoa_fee: int | None = None
days_on_mls: int | None = None
description: Description | None = None description: Description | None = None
latitude: float | None = None latitude: float | None = None

View File

@@ -4,11 +4,11 @@ homeharvest.realtor.__init__
This module implements the scraper for realtor.com This module implements the scraper for realtor.com
""" """
from datetime import datetime
from typing import Dict, Union, Optional from typing import Dict, Union, Optional
from concurrent.futures import ThreadPoolExecutor, as_completed from concurrent.futures import ThreadPoolExecutor, as_completed
from .. import Scraper from .. import Scraper
from ....exceptions import NoResultsFound
from ..models import Property, Address, ListingType, Description from ..models import Property, Address, ListingType, Description
@@ -18,7 +18,6 @@ class RealtorScraper(Scraper):
ADDRESS_AUTOCOMPLETE_URL = "https://parser-external.geo.moveaws.com/suggest" ADDRESS_AUTOCOMPLETE_URL = "https://parser-external.geo.moveaws.com/suggest"
def __init__(self, scraper_input): def __init__(self, scraper_input):
self.counter = 1
super().__init__(scraper_input) super().__init__(scraper_input)
def handle_location(self): def handle_location(self):
@@ -38,7 +37,7 @@ class RealtorScraper(Scraper):
result = response_json["autocomplete"] result = response_json["autocomplete"]
if not result: if not result:
raise NoResultsFound("No results found for location: " + self.location) return None
return result[0] return result[0]
@@ -50,6 +49,7 @@ class RealtorScraper(Scraper):
listing_id listing_id
} }
address { address {
street_direction
street_number street_number
street_name street_name
street_suffix street_suffix
@@ -84,6 +84,12 @@ class RealtorScraper(Scraper):
garage garage
permalink permalink
} }
primary_photo {
href
}
photos {
href
}
} }
}""" }"""
@@ -110,6 +116,24 @@ class RealtorScraper(Scraper):
and property_info["address"].get("location") and property_info["address"].get("location")
and property_info["address"]["location"].get("coordinate") and property_info["address"]["location"].get("coordinate")
) )
list_date_str = property_info["basic"]["list_date"].split("T")[0] if property_info["basic"].get(
"list_date") else None
last_sold_date_str = property_info["basic"]["sold_date"].split("T")[0] if property_info["basic"].get(
"sold_date") else None
list_date = datetime.strptime(list_date_str, "%Y-%m-%d") if list_date_str else None
last_sold_date = datetime.strptime(last_sold_date_str, "%Y-%m-%d") if last_sold_date_str else None
today = datetime.now()
days_on_mls = None
status = property_info["basic"]["status"].lower()
if list_date:
if status == "sold" and last_sold_date:
days_on_mls = (last_sold_date - list_date).days
elif status in ('for_sale', 'for_rent'):
days_on_mls = (today - list_date).days
if days_on_mls and days_on_mls < 0:
days_on_mls = None
listing = Property( listing = Property(
mls=mls, mls=mls,
@@ -119,17 +143,13 @@ class RealtorScraper(Scraper):
property_url=f"{self.PROPERTY_URL}{property_info['details']['permalink']}", property_url=f"{self.PROPERTY_URL}{property_info['details']['permalink']}",
status=property_info["basic"]["status"].upper(), status=property_info["basic"]["status"].upper(),
list_price=property_info["basic"]["price"], list_price=property_info["basic"]["price"],
list_date=property_info["basic"]["list_date"].split("T")[0] list_date=list_date,
if property_info["basic"].get("list_date")
else None,
prc_sqft=property_info["basic"].get("price") prc_sqft=property_info["basic"].get("price")
/ property_info["basic"].get("sqft") / property_info["basic"].get("sqft")
if property_info["basic"].get("price") if property_info["basic"].get("price")
and property_info["basic"].get("sqft") and property_info["basic"].get("sqft")
else None, else None,
last_sold_date=property_info["basic"]["sold_date"].split("T")[0] last_sold_date=last_sold_date,
if property_info["basic"].get("sold_date")
else None,
latitude=property_info["address"]["location"]["coordinate"].get("lat") latitude=property_info["address"]["location"]["coordinate"].get("lat")
if able_to_get_lat_long if able_to_get_lat_long
else None, else None,
@@ -138,6 +158,8 @@ class RealtorScraper(Scraper):
else None, else None,
address=self._parse_address(property_info, search_type="handle_listing"), address=self._parse_address(property_info, search_type="handle_listing"),
description=Description( description=Description(
primary_photo=property_info["primary_photo"].get("href", "").replace("s.jpg", "od-w480_h360_x2.webp?w=1080&q=75"),
alt_photos=self.process_alt_photos(property_info.get("photos", [])),
style=property_info["basic"].get("type", "").upper(), style=property_info["basic"].get("type", "").upper(),
beds=property_info["basic"].get("beds"), beds=property_info["basic"].get("beds"),
baths_full=property_info["basic"].get("baths_full"), baths_full=property_info["basic"].get("baths_full"),
@@ -149,6 +171,7 @@ class RealtorScraper(Scraper):
garage=property_info["details"].get("garage"), garage=property_info["details"].get("garage"),
stories=property_info["details"].get("stories"), stories=property_info["details"].get("stories"),
), ),
days_on_mls=days_on_mls
) )
return [listing] return [listing]
@@ -201,6 +224,7 @@ class RealtorScraper(Scraper):
stories stories
} }
address { address {
street_direction
street_number street_number
street_name street_name
street_suffix street_suffix
@@ -231,6 +255,12 @@ class RealtorScraper(Scraper):
units units
year_built year_built
} }
primary_photo {
href
}
photos {
href
}
} }
}""" }"""
@@ -274,6 +304,10 @@ class RealtorScraper(Scraper):
last_sold_date last_sold_date
list_price list_price
price_per_sqft price_per_sqft
flags {
is_contingent
is_pending
}
description { description {
sqft sqft
beds beds
@@ -297,6 +331,7 @@ class RealtorScraper(Scraper):
} }
location { location {
address { address {
street_direction
street_number street_number
street_name street_name
street_suffix street_suffix
@@ -313,19 +348,27 @@ class RealtorScraper(Scraper):
name name
} }
} }
primary_photo {
href
}
photos {
href
}
} }
} }
}""" }"""
date_param = ( date_param = ""
'sold_date: { min: "$today-%sD" }' % self.last_x_days if self.listing_type == ListingType.SOLD:
if self.listing_type == ListingType.SOLD and self.last_x_days if self.date_from and self.date_to:
else ( date_param = f'sold_date: {{ min: "{self.date_from}", max: "{self.date_to}" }}'
'list_date: { min: "$today-%sD" }' % self.last_x_days elif self.last_x_days:
if self.last_x_days date_param = f'sold_date: {{ min: "$today-{self.last_x_days}D" }}'
else "" else:
) if self.date_from and self.date_to:
) date_param = f'list_date: {{ min: "{self.date_from}", max: "{self.date_to}" }}'
elif self.last_x_days:
date_param = f'list_date: {{ min: "$today-{self.last_x_days}D" }}'
sort_param = ( sort_param = (
"sort: [{ field: sold_date, direction: desc }]" "sort: [{ field: sold_date, direction: desc }]"
@@ -335,17 +378,19 @@ class RealtorScraper(Scraper):
pending_or_contingent_param = ( pending_or_contingent_param = (
"or_filters: { contingent: true, pending: true }" "or_filters: { contingent: true, pending: true }"
if self.pending_or_contingent if self.listing_type == ListingType.PENDING
else "" else ""
) )
listing_type = ListingType.FOR_SALE if self.listing_type == ListingType.PENDING else self.listing_type
if search_type == "comps": #: comps search, came from an address if search_type == "comps": #: comps search, came from an address
query = """query Property_search( query = """query Property_search(
$coordinates: [Float]! $coordinates: [Float]!
$radius: String! $radius: String!
$offset: Int!, $offset: Int!,
) { ) {
property_search( home_search(
query: { query: {
nearby: { nearby: {
coordinates: $coordinates coordinates: $coordinates
@@ -353,13 +398,15 @@ class RealtorScraper(Scraper):
} }
status: %s status: %s
%s %s
%s
} }
%s %s
limit: 200 limit: 200
offset: $offset offset: $offset
) %s""" % ( ) %s""" % (
self.listing_type.value.lower(), listing_type.value.lower(),
date_param, date_param,
pending_or_contingent_param,
sort_param, sort_param,
results_query, results_query,
) )
@@ -385,7 +432,7 @@ class RealtorScraper(Scraper):
limit: 200 limit: 200
offset: $offset offset: $offset
) %s""" % ( ) %s""" % (
self.listing_type.value.lower(), listing_type.value.lower(),
date_param, date_param,
pending_or_contingent_param, pending_or_contingent_param,
sort_param, sort_param,
@@ -415,7 +462,7 @@ class RealtorScraper(Scraper):
response = self.session.post(self.SEARCH_GQL_URL, json=payload) response = self.session.post(self.SEARCH_GQL_URL, json=payload)
response.raise_for_status() response.raise_for_status()
response_json = response.json() response_json = response.json()
search_key = "home_search" if search_type == "area" else "property_search" search_key = "home_search" if "home_search" in query else "property_search"
properties: list[Property] = [] properties: list[Property] = []
@@ -430,7 +477,6 @@ class RealtorScraper(Scraper):
return {"total": 0, "properties": []} return {"total": 0, "properties": []}
for result in response_json["data"][search_key]["results"]: for result in response_json["data"][search_key]["results"]:
self.counter += 1
mls = ( mls = (
result["source"].get("id") result["source"].get("id")
if "source" in result and isinstance(result["source"], dict) if "source" in result and isinstance(result["source"], dict)
@@ -447,13 +493,18 @@ class RealtorScraper(Scraper):
and result["location"]["address"].get("coordinate") and result["location"]["address"].get("coordinate")
) )
is_pending = result["flags"].get("is_pending") or result["flags"].get("is_contingent")
if is_pending and self.listing_type != ListingType.PENDING:
continue
realty_property = Property( realty_property = Property(
mls=mls, mls=mls,
mls_id=result["source"].get("listing_id") mls_id=result["source"].get("listing_id")
if "source" in result and isinstance(result["source"], dict) if "source" in result and isinstance(result["source"], dict)
else None, else None,
property_url=f"{self.PROPERTY_URL}{result['property_id']}", property_url=f"{self.PROPERTY_URL}{result['property_id']}",
status=result["status"].upper(), status="PENDING" if is_pending else result["status"].upper(),
list_price=result["list_price"], list_price=result["list_price"],
list_date=result["list_date"].split("T")[0] list_date=result["list_date"].split("T")[0]
if result.get("list_date") if result.get("list_date")
@@ -470,8 +521,8 @@ class RealtorScraper(Scraper):
if able_to_get_lat_long if able_to_get_lat_long
else None, else None,
address=self._parse_address(result, search_type="general_search"), address=self._parse_address(result, search_type="general_search"),
#: neighborhoods=self._parse_neighborhoods(result),
description=self._parse_description(result), description=self._parse_description(result),
days_on_mls=self.calculate_days_on_mls(result)
) )
properties.append(realty_property) properties.append(realty_property)
@@ -482,6 +533,9 @@ class RealtorScraper(Scraper):
def search(self): def search(self):
location_info = self.handle_location() location_info = self.handle_location()
if not location_info:
return []
location_type = location_info["area_type"] location_type = location_info["area_type"]
search_variables = { search_variables = {
@@ -560,36 +614,53 @@ class RealtorScraper(Scraper):
return ", ".join(neighborhoods_list) if neighborhoods_list else None return ", ".join(neighborhoods_list) if neighborhoods_list else None
@staticmethod @staticmethod
def _parse_address(result: dict, search_type): def handle_none_safely(address_part):
if address_part is None:
return ""
return address_part
def _parse_address(self, result: dict, search_type):
if search_type == "general_search": if search_type == "general_search":
address = result['location']['address']
else:
address = result["address"]
return Address( return Address(
street=f"{result['location']['address']['street_number']} {result['location']['address']['street_name']} {result['location']['address']['street_suffix']}", street=" ".join([
unit=result["location"]["address"]["unit"], self.handle_none_safely(address.get('street_number')),
city=result["location"]["address"]["city"], self.handle_none_safely(address.get('street_direction')),
state=result["location"]["address"]["state_code"], self.handle_none_safely(address.get('street_name')),
zip=result["location"]["address"]["postal_code"], self.handle_none_safely(address.get('street_suffix')),
) ]).strip(),
return Address( unit=address["unit"],
street=f"{result['address']['street_number']} {result['address']['street_name']} {result['address']['street_suffix']}", city=address["city"],
unit=result["address"]["unit"], state=address["state_code"],
city=result["address"]["city"], zip=address["postal_code"],
state=result["address"]["state_code"],
zip=result["address"]["postal_code"],
) )
@staticmethod @staticmethod
def _parse_description(result: dict) -> Description: def _parse_description(result: dict) -> Description:
description_data = result.get("description", {}) description_data = result.get("description", {})
if description_data is None or not isinstance(description_data, dict): if description_data is None or not isinstance(description_data, dict):
print("Warning: description_data is invalid!")
description_data = {} description_data = {}
style = description_data.get("type", "") style = description_data.get("type", "")
if style is not None: if style is not None:
style = style.upper() style = style.upper()
primary_photo = ""
if result and "primary_photo" in result:
primary_photo_info = result["primary_photo"]
if primary_photo_info and "href" in primary_photo_info:
primary_photo_href = primary_photo_info["href"]
primary_photo = primary_photo_href.replace("s.jpg", "od-w480_h360_x2.webp?w=1080&q=75")
return Description( return Description(
primary_photo=primary_photo,
alt_photos=RealtorScraper.process_alt_photos(result.get("photos")),
style=style, style=style,
beds=description_data.get("beds"), beds=description_data.get("beds"),
baths_full=description_data.get("baths_full"), baths_full=description_data.get("baths_full"),
@@ -601,3 +672,36 @@ class RealtorScraper(Scraper):
garage=description_data.get("garage"), garage=description_data.get("garage"),
stories=description_data.get("stories"), stories=description_data.get("stories"),
) )
@staticmethod
def calculate_days_on_mls(result: dict) -> Optional[int]:
list_date_str = result.get("list_date")
list_date = datetime.strptime(list_date_str.split("T")[0], "%Y-%m-%d") if list_date_str else None
last_sold_date_str = result.get("last_sold_date")
last_sold_date = datetime.strptime(last_sold_date_str, "%Y-%m-%d") if last_sold_date_str else None
today = datetime.now()
if list_date:
if result["status"] == 'sold':
if last_sold_date:
days = (last_sold_date - list_date).days
if days >= 0:
return days
elif result["status"] in ('for_sale', 'for_rent'):
days = (today - list_date).days
if days >= 0:
return days
@staticmethod
def process_alt_photos(photos_info):
try:
alt_photos = []
if photos_info:
for photo_info in photos_info:
href = photo_info.get("href", "")
alt_photo_href = href.replace("s.jpg", "od-w480_h360_x2.webp?w=1080&q=75")
alt_photos.append(alt_photo_href)
return alt_photos
except Exception:
pass

View File

@@ -1,6 +1,5 @@
class InvalidListingType(Exception): class InvalidListingType(Exception):
"""Raised when a provided listing type is does not exist.""" """Raised when a provided listing type is does not exist."""
class InvalidDate(Exception):
class NoResultsFound(Exception): """Raised when only one of date_from or date_to is provided or not in the correct format. ex: 2023-10-23 """
"""Raised when no results are found for the given location"""

View File

@@ -1,9 +1,12 @@
from .core.scrapers.models import Property, ListingType
import pandas as pd import pandas as pd
from .exceptions import InvalidListingType from datetime import datetime
from .core.scrapers.models import Property, ListingType
from .exceptions import InvalidListingType, InvalidDate
ordered_properties = [ ordered_properties = [
"property_url", "property_url",
"primary_photo",
"alt_photos",
"mls", "mls",
"mls_id", "mls_id",
"status", "status",
@@ -18,6 +21,7 @@ ordered_properties = [
"half_baths", "half_baths",
"sqft", "sqft",
"year_built", "year_built",
"days_on_mls",
"list_price", "list_price",
"list_date", "list_date",
"sold_price", "sold_price",
@@ -47,6 +51,8 @@ def process_result(result: Property) -> pd.DataFrame:
prop_data["price_per_sqft"] = prop_data["prc_sqft"] prop_data["price_per_sqft"] = prop_data["prc_sqft"]
description = result.description description = result.description
prop_data["primary_photo"] = description.primary_photo
prop_data["alt_photos"] = ", ".join(description.alt_photos)
prop_data["style"] = description.style prop_data["style"] = description.style
prop_data["beds"] = description.beds prop_data["beds"] = description.beds
prop_data["full_baths"] = description.baths_full prop_data["full_baths"] = description.baths_full
@@ -69,3 +75,18 @@ def validate_input(listing_type: str) -> None:
raise InvalidListingType( raise InvalidListingType(
f"Provided listing type, '{listing_type}', does not exist." f"Provided listing type, '{listing_type}', does not exist."
) )
def validate_dates(date_from: str | None, date_to: str | None) -> None:
if (date_from is not None and date_to is None) or (date_from is None and date_to is not None):
raise InvalidDate("Both date_from and date_to must be provided.")
if date_from and date_to:
try:
date_from_obj = datetime.strptime(date_from, "%Y-%m-%d")
date_to_obj = datetime.strptime(date_to, "%Y-%m-%d")
if date_to_obj < date_from_obj:
raise InvalidDate("date_to must be after date_from.")
except ValueError as e:
raise InvalidDate(f"Invalid date format or range")

View File

@@ -1,9 +1,9 @@
[tool.poetry] [tool.poetry]
name = "homeharvest" name = "homeharvest"
version = "0.3.3" version = "0.3.10"
description = "Real estate scraping library supporting Zillow, Realtor.com & Redfin." description = "Real estate scraping library supporting Zillow, Realtor.com & Redfin."
authors = ["Zachary Hampton <zachary@zacharysproducts.com>", "Cullen Watson <cullen@cullen.ai>"] authors = ["Zachary Hampton <zachary@zacharysproducts.com>", "Cullen Watson <cullen@cullen.ai>"]
homepage = "https://github.com/ZacharyHampton/HomeHarvest" homepage = "https://github.com/Bunsly/HomeHarvest"
readme = "README.md" readme = "README.md"
[tool.poetry.scripts] [tool.poetry.scripts]

View File

@@ -1,20 +1,15 @@
from homeharvest import scrape_property from homeharvest import scrape_property
from homeharvest.exceptions import ( from homeharvest.exceptions import (
InvalidListingType, InvalidListingType,
NoResultsFound,
) )
def test_realtor_pending_or_contingent(): def test_realtor_pending_or_contingent():
pending_or_contingent_result = scrape_property( pending_or_contingent_result = scrape_property(
location="Surprise, AZ", location="Surprise, AZ", listing_type="pending"
pending_or_contingent=True,
) )
regular_result = scrape_property( regular_result = scrape_property(location="Surprise, AZ", listing_type="for_sale")
location="Surprise, AZ",
pending_or_contingent=False,
)
assert all( assert all(
[ [
@@ -25,6 +20,45 @@ def test_realtor_pending_or_contingent():
assert len(pending_or_contingent_result) != len(regular_result) assert len(pending_or_contingent_result) != len(regular_result)
def test_realtor_pending_comps():
pending_comps = scrape_property(
location="2530 Al Lipscomb Way",
radius=5,
past_days=180,
listing_type="pending",
)
for_sale_comps = scrape_property(
location="2530 Al Lipscomb Way",
radius=5,
past_days=180,
listing_type="for_sale",
)
sold_comps = scrape_property(
location="2530 Al Lipscomb Way",
radius=5,
past_days=180,
listing_type="sold",
)
results = [pending_comps, for_sale_comps, sold_comps]
assert all([result is not None for result in results])
#: assert all lengths are different
assert len(set([len(result) for result in results])) == len(results)
def test_realtor_sold_past():
result = scrape_property(
location="San Diego, CA",
past_days=30,
listing_type="sold",
)
assert result is not None and len(result) > 0
def test_realtor_comps(): def test_realtor_comps():
result = scrape_property( result = scrape_property(
location="2530 Al Lipscomb Way", location="2530 Al Lipscomb Way",
@@ -50,6 +84,20 @@ def test_realtor_last_x_days_sold():
) and len(days_result_30) != len(days_result_10) ) and len(days_result_30) != len(days_result_10)
def test_realtor_date_range_sold():
days_result_30 = scrape_property(
location="Dallas, TX", listing_type="sold", date_from="2023-05-01", date_to="2023-05-28"
)
days_result_60 = scrape_property(
location="Dallas, TX", listing_type="sold", date_from="2023-04-01", date_to="2023-06-10"
)
assert all(
[result is not None for result in [days_result_30, days_result_60]]
) and len(days_result_30) < len(days_result_60)
def test_realtor_single_property(): def test_realtor_single_property():
results = [ results = [
scrape_property( scrape_property(
@@ -82,15 +130,12 @@ def test_realtor():
assert all([result is not None for result in results]) assert all([result is not None for result in results])
bad_results = []
try: def test_realtor_bad_address():
bad_results += [ bad_results = scrape_property(
scrape_property(
location="abceefg ju098ot498hh9", location="abceefg ju098ot498hh9",
listing_type="for_sale", listing_type="for_sale",
) )
] if len(bad_results) == 0:
except (InvalidListingType, NoResultsFound):
assert True assert True
assert all([result is None for result in bad_results])