Compare commits

...

159 Commits

Author SHA1 Message Date
Zachary Hampton
2d75ca4dfa Merge pull request #131 from ZacharyHampton/feature/data-additions
Feature/data additions
2025-07-15 13:56:16 -07:00
Zachary Hampton
ca1be85a93 - delete test 2025-07-15 13:55:40 -07:00
Zachary Hampton
145c337b55 - data quality and clean up code 2025-07-15 13:51:47 -07:00
Zachary Hampton
6c6243eba4 - add all new data fields 2025-07-15 13:21:48 -07:00
Zachary Hampton
79082090cb - pydantic conversion 2025-07-15 12:25:43 -07:00
Zachary Hampton
8311f4dfbc - data additions 2025-07-15 12:00:19 -07:00
Zachary Hampton
0d85100091 - update dependencies 2025-07-14 17:08:27 -07:00
Zachary Hampton
851ba53d81 Merge pull request #128 from Alexandre-Shofstall/fix/python39-compat
Fix syntax of __init__ line 24
2025-07-03 10:28:49 -07:00
Zachary Hampton
0fdc309262 Update pyproject.toml 2025-07-03 10:28:14 -07:00
Alexandre Shofstall
62b6726d42 Fix syntax of __init__ line 24 2025-07-03 19:20:49 +02:00
Zachary Hampton
ccf5786ce2 Merge pull request #127 from Alexandre-Shofstall/fix/python39-compat
Fix typing syntax for Python 3.9 compatibility in __init__.py
2025-07-03 09:43:26 -07:00
Zachary Hampton
b4f05b254a Update pyproject.toml 2025-07-03 09:43:10 -07:00
Alexandre Shofstall
941d1081f7 Fix typing syntax for Python 3.9 compatibility in __init__.py 2025-07-03 18:11:18 +02:00
Zachary Hampton
c788b3318d Update README.md 2025-06-19 16:52:14 -07:00
zachary
68a3438c6e - single home return type bug fix 2025-05-05 12:29:36 -07:00
zachary
a3c5e9060e - updated queries 2025-05-03 13:55:56 -07:00
zachary
d06595fe56 - updated queries 2025-05-03 13:28:12 -07:00
zachary
e378feeefe - bug fixes 2025-04-12 18:34:35 -07:00
zachary
8a5683fe79 - return type parameter
- optimized get extra fields with query clustering
2025-04-12 17:55:52 -07:00
Zachary Hampton
65f799a27d Update README.md 2025-02-21 13:33:32 -07:00
Cullen Watson
0de916e590 enh:tax history 2025-01-06 05:28:36 -06:00
Cullen Watson
6a3f7df087 chore:yml 2024-11-05 23:55:59 -06:00
Cullen Watson
a75bcc2aa0 docs:readme 2024-11-04 10:22:32 -06:00
Cullen Watson
1082b86fa1 docs:readme 2024-11-03 17:23:58 -06:00
Cullen Watson
8e04f6b117 enh: property type (#102) 2024-11-03 17:23:07 -06:00
Zachary Hampton
1f717bd9e3 - switch eps
- new hrefs
- property_id, listing_id data points
2024-09-06 15:49:07 -07:00
Zachary Hampton
8cfe056f79 - office mls set 2024-08-23 10:54:43 -07:00
Zachary Hampton
1010c743b6 - agent mls set and nrds id 2024-08-23 10:47:45 -07:00
Zachary Hampton
32fdc281e3 - rewrote & optimized flow
- new_construction data point
- renamed "agent" & "broker" to "agent_name" & "broker_name"
- added builder & office data
- added entity uuids
2024-08-20 05:19:15 -07:00
Zachary Hampton
6d14b8df5a - fix limit parameter
- fix specific for_rent apartment listing prices
2024-08-13 10:44:11 -07:00
Zachary Hampton
3f44744d61 - primary photo bug fix
- limit parameter
2024-07-15 07:19:57 -07:00
Zachary Hampton
ac0cad62a7 - optimizations 2024-06-14 21:50:23 -07:00
Cullen Watson
beb885cc8d fix: govt type (#82) 2024-06-12 17:34:34 -05:00
Zachary Hampton
011680f7d8 - style error bug fix 2024-06-06 15:24:12 -07:00
Zachary Hampton
93e6778a48 - exclude_pending parameter 2024-05-31 22:17:29 -07:00
Zachary Hampton
ec036bb989 - optimizations & updated realtor headers 2024-05-20 12:13:30 -07:00
Zachary Hampton
aacd168545 - alt photos bug fix 2024-05-18 17:47:55 -07:00
Zachary Hampton
0d70007000 - alt photos bug fix 2024-05-16 23:04:07 -07:00
Zachary Hampton
018d3fbac4 - Python 3.9 support (tested) (could potentially work for lower versions, but I have not validated such) 2024-05-14 19:13:04 -07:00
Zachary Hampton
803fd618e9 - data cleaning & CONDOP bug fixes 2024-05-12 21:12:12 -07:00
Zachary Hampton
b23b55ca80 - full street line (data quality improvement) 2024-05-12 18:49:44 -07:00
Zachary Hampton
3458a08383 - broker data 2024-05-11 21:35:29 -07:00
Zachary Hampton
c3e24a4ce0 - extra_property_details parameter
- updated docs
- classified exception
2024-05-02 09:04:49 -07:00
Zachary Hampton
46985dcee4 - various data quality fixes (including #70) 2024-05-02 08:48:53 -07:00
Cullen Watson
04ae968716 enh: assessed/estimated value (#77) 2024-04-30 15:29:54 -05:00
Cullen
c5b15e9be5 chore: version 2024-04-20 17:45:29 -05:00
joecryptotoo
7a525caeb8 added county, fips, and text desciption fields (#72) 2024-04-20 17:44:28 -05:00
Cullen Watson
7246703999 Schools (#69) 2024-04-16 20:01:20 -05:00
Cullen Watson
6076b0f961 enh: add agent (#68) 2024-04-16 15:09:32 -05:00
Cullen Watson
cdc6f2a2a8 docs: readme 2024-04-16 14:59:50 -05:00
Cullen Watson
0bdf56568e enh: add agent name/phone (#66) 2024-04-16 14:55:44 -05:00
Cullen Watson
1f47fc3b7e fix: use enum value (#65) 2024-04-12 01:41:15 -05:00
Zachary Hampton
5c2498c62b - pending date, property type fields (#45)
- alt photos bug fix (#57)
2024-03-13 19:17:17 -07:00
Zachary Hampton
d775540afd - location bug fix 2024-03-06 16:31:06 -07:00
Cullen Watson
5ea9a6f6b6 docs: readme 2024-03-03 11:49:27 -06:00
robertomr100
ab6a0e3b6e Add foreclosure parameter (#55) 2024-03-03 11:45:28 -06:00
Zachary Hampton
03198428de Merge pull request #48 from Bunsly/for_rent_url
fix: rent url
2024-01-09 13:12:30 -07:00
Cullen Watson
70fa071318 fix: rent url 2024-01-08 12:46:31 -06:00
Cullen Watson
f7e74cf535 Merge pull request #44 from Bunsly/fix_postal_search
fix postal search to search just by zip
2023-12-02 00:40:13 -06:00
Cullen Watson
e17b976923 fix postal search to search just by zip 2023-12-02 00:39:28 -06:00
Zachary Hampton
ad13b55ea6 Update README.md 2023-11-30 11:48:48 -07:00
Cullen Watson
19f23c95c4 Merge pull request #43 from Bunsly/add_photos
Add photos
2023-11-24 21:40:34 -06:00
Cullen
4676ec9839 chore: remove test file 2023-11-24 13:42:52 -06:00
Cullen
6dd0b058d3 chore: version 2023-11-24 13:41:46 -06:00
Cullen
a74c1a9950 enh: add photos 2023-11-24 13:40:57 -06:00
Cullen Watson
fa507dbc72 docs: typo 2023-11-20 01:05:10 -06:00
Cullen Watson
5b6a9943cc Merge pull request #42 from Bunsly/street_dirction
fix: add street direction
2023-11-08 16:53:29 -06:00
Cullen Watson
9816defaf3 chore: version 2023-11-08 16:53:05 -06:00
Cullen Watson
f692b438b2 fix: add street direction 2023-11-08 16:52:06 -06:00
Zachary Hampton
30f48f54c8 Update README.md 2023-11-06 22:13:01 -07:00
Cullen Watson
7f86f69610 docs: readme 2023-11-03 18:53:46 -05:00
Cullen Watson
cc64dacdb0 docs: readme - date_from, date_to 2023-11-03 18:52:22 -05:00
Cullen Watson
d3268d8e5a Merge pull request #40 from Bunsly/date_range
Add date_to and date_from params
2023-11-03 18:42:13 -05:00
Cullen Watson
4edad901c5 [enh] date_to and date_from 2023-11-03 18:40:34 -05:00
Zachary Hampton
c597a78191 - None address bug fix 2023-10-18 16:32:43 -07:00
Zachary Hampton
11a7d854f0 - remove pending listings from for_sale 2023-10-18 14:41:41 -07:00
Zachary Hampton
f726548cc6 Update pyproject.toml 2023-10-18 09:35:48 -07:00
Zachary Hampton
fad7d670eb Update README.md 2023-10-18 08:37:42 -07:00
Zachary Hampton
89a6f93c9f Update pyproject.toml 2023-10-18 08:37:26 -07:00
Zachary Hampton
e1090b06e4 Update README.md 2023-10-17 20:22:25 -07:00
Cullen Watson
5036e74b60 Merge branch 'master' of https://github.com/ZacharyHampton/HomeHarvest 2023-10-09 11:30:17 -05:00
Cullen Watson
2cb544bc8d [chore] display clickable URLs in jupyter 2023-10-09 11:28:56 -05:00
Zachary Hampton
68cb365e03 Merge pull request #34 from ZacharyHampton/days_on_mls
[enh] days_on_mls attr
2023-10-09 09:04:59 -07:00
Cullen Watson
23876d5725 [chore] function types 2023-10-09 11:02:51 -05:00
Cullen Watson
b59d55f6b5 [enh] days_on_mls attr 2023-10-09 11:00:36 -05:00
Cullen Watson
3c3adb5f29 [docs] update video 2023-10-05 20:24:23 -05:00
Zachary Hampton
6ede8622cc - pending listing support
- removal of pending_or_contingent param
2023-10-05 11:43:00 -07:00
Cullen Watson
9f50d33bdb [chore] remove unused dependency 2023-10-05 10:11:45 -05:00
Cullen Watson
735ec021f7 [docs] README 2023-10-05 10:03:21 -05:00
Zachary Hampton
00537329cf - version bump 2023-10-04 21:35:21 -07:00
Zachary Hampton
a9225b532f - rename days variable 2023-10-04 21:35:14 -07:00
Zachary Hampton
ba7ad069c9 Merge pull request #32 from ZacharyHampton/key_error
[fix] keyerror on style
2023-10-04 20:35:05 -07:00
Cullen Watson
22bda972b0 [chore] version number 2023-10-04 22:34:52 -05:00
Cullen Watson
6f5bbf79a4 [fix] keyerror on style 2023-10-04 22:33:21 -05:00
Cullen Watson
608cceba34 [docs] reorder 2023-10-04 22:12:16 -05:00
Cullen Watson
3609586995 [docs]: add contingent to example 2023-10-04 22:11:38 -05:00
Cullen Watson
68c7e411e4 [docs] pending / contingent searches 2023-10-04 22:07:51 -05:00
Cullen Watson
5e825601a7 [docs] update example 2023-10-04 21:50:54 -05:00
Cullen Watson
ce3f94d0af [docs] update example 2023-10-04 21:50:16 -05:00
Zachary Hampton
4a1116440d Merge pull request #31 from ZacharyHampton/v0.3
v0.3
2023-10-04 19:26:44 -07:00
Cullen Watson
2d092c595f [docs]: Update README.md 2023-10-04 21:24:24 -05:00
Cullen Watson
4dbb064fe9 [docs]: Update README.md 2023-10-04 21:21:45 -05:00
Cullen Watson
4e78248032 Update README.md 2023-10-04 21:17:49 -05:00
Zachary Hampton
37e20f4469 - remove neighborhoods
- rename data
2023-10-04 18:44:47 -07:00
Zachary Hampton
8a5f0dc2c9 - pending or contingent support 2023-10-04 18:25:01 -07:00
Zachary Hampton
de692faae2 - rename last_x_days
- docstrings for scrape_property
2023-10-04 18:06:06 -07:00
Zachary Hampton
6bb68766fc - realtor tests 2023-10-04 12:04:05 -07:00
Zachary Hampton
446d5488b8 - single address support again 2023-10-04 10:07:32 -07:00
Cullen Watson
68e15ce696 [docs] clarify example 2023-10-04 10:14:11 -05:00
Cullen Watson
c4870677c2 [enh]: make last_x_days generic
add mls_only
make radius generic
2023-10-04 10:11:53 -05:00
Cullen Watson
51bde20c3c [chore]: clean up 2023-10-04 08:58:55 -05:00
Zachary Hampton
f8c0dd766d - realtor support 2023-10-03 23:33:53 -07:00
Zachary Hampton
f06a01678c - cli readme update 2023-10-03 22:31:23 -07:00
Zachary Hampton
d2879734e6 - cli update 2023-10-03 22:25:29 -07:00
Zachary Hampton
bf81ef413f - version bump 2023-10-03 22:22:09 -07:00
Zachary Hampton
29664e4eee - cullen merge 2023-10-03 22:21:16 -07:00
Zachary Hampton
088088ae51 - last x days param 2023-10-03 15:05:17 -07:00
Zachary Hampton
40bbf76db1 - realtor radius 2023-10-02 13:58:47 -07:00
Zachary Hampton
1f1ca8068f - realtor.com default 2023-10-02 10:28:13 -07:00
Zachary Hampton
8388d47f73 - version bump 2023-10-01 09:13:37 -07:00
Zachary Hampton
ba503b0ca3 Merge pull request #27 from ddxv/zillow-ua-header
Zillow Request Header: Match observed behaivor in FireFox of not sending sec-ch-ua headers
2023-10-01 09:12:58 -07:00
james
8962d619e1 Match observed behaivor in FireFox of not sending ua-ch headers in request to prevent recent 403 2023-10-01 11:31:51 +08:00
Zachary Hampton
3b7c17b7b5 - zillow proxy support 2023-09-28 18:40:16 -07:00
Zachary Hampton
59317fd6fc Merge pull request #25 from ZacharyHampton/fix/recent-issues
Fix/recent issues
2023-09-28 18:27:04 -07:00
Zachary Hampton
928b431d1f - bump version 2023-09-28 18:25:53 -07:00
Zachary Hampton
896f862137 - zillow flow update 2023-09-28 18:25:47 -07:00
Zachary Hampton
3174f5076c Merge pull request #23 from ZacharyHampton/fix/recent-issues
Fixes & Changes for recent issues
2023-09-28 18:07:55 -07:00
Zachary Hampton
2abbb913a8 - convert posted_time to datetime
- zillow location bug fix
2023-09-28 18:07:42 -07:00
Cullen Watson
73b6d5b33f [fix] zilow tls client 2023-09-28 19:34:01 -05:00
Zachary Hampton
da39c989d9 - version bump 2023-09-28 15:27:36 -07:00
Zachary Hampton
01c53f9399 - redfin bug fix
- add recent features for issues
2023-09-28 15:19:43 -07:00
Zachary Hampton
9200c17df2 - version bump 2023-09-23 10:55:50 -07:00
Zachary Hampton
9e262bf214 Merge remote-tracking branch 'origin/master' 2023-09-23 10:55:29 -07:00
Zachary Hampton
82f78fb578 - zillow bug fix 2023-09-23 10:55:14 -07:00
Cullen Watson
b0e40df00a Update pyproject.toml 2023-09-22 09:51:24 -05:00
Cullen Watson
2fc40e0dad fix: cookie 2023-09-22 09:47:37 -05:00
Zachary Hampton
254f3a68a1 - redfin bug fix 2023-09-21 18:54:03 -07:00
Zachary Hampton
05713c76b0 - redfin bug fix
- .get
2023-09-21 11:27:12 -07:00
Cullen Watson
9120cc9bfe fix: remove line 2023-09-21 13:10:14 -05:00
Cullen Watson
eee4b19515 Merge branch 'master' of https://github.com/ZacharyHampton/HomeHarvest 2023-09-21 13:06:15 -05:00
Cullen Watson
c25961eded fix: KeyEror : [minBaths] 2023-09-21 13:06:06 -05:00
Zachary Hampton
0884c3d163 Update README.md 2023-09-21 09:55:29 -07:00
Cullen Watson
8f37bfdeb8 chore: version number 2023-09-21 11:19:23 -05:00
Cullen Watson
48c2338276 fix: keyerror 2023-09-21 11:18:37 -05:00
Cullen Watson
f58a1f4a74 docs: tryhomeharvest.com 2023-09-21 10:57:11 -05:00
Zachary Hampton
4cef926d7d Merge pull request #14 from ZacharyHampton/keep_duplicates_flag
Keep duplicates flag
2023-09-20 20:27:08 -07:00
Cullen Watson
e82eeaa59f docs: add keep duplicates flag 2023-09-20 20:25:50 -05:00
Cullen Watson
644f16b25b feat: keep duplicates flag 2023-09-20 20:24:18 -05:00
Cullen Watson
e9ddc6df92 docs: update tutorial vid for release v0.2.7 2023-09-19 22:18:49 -05:00
Cullen Watson
50fb1c391d docs: update property schema 2023-09-19 21:35:37 -05:00
Cullen Watson
4f91f9dadb chore: version number 2023-09-19 21:17:12 -05:00
Zachary Hampton
66e55173b1 Merge pull request #13 from ZacharyHampton/simplify_fields
fix: simplify fields
2023-09-19 19:16:18 -07:00
Cullen Watson
f6054e8746 fix: simplify fields 2023-09-19 21:13:20 -05:00
Cullen Watson
e8d9235ee6 chore: update version number 2023-09-19 16:43:59 -05:00
Cullen Watson
043f091158 fix: keyerror on address 2023-09-19 16:43:17 -05:00
Cullen Watson
eae8108978 docs: change cmd 2023-09-19 16:18:01 -05:00
Zachary Hampton
0a39357a07 Merge pull request #12 from ZacharyHampton/proxy_bug
fix: proxy add to session correctly
2023-09-19 14:07:25 -07:00
Cullen Watson
8f06d46ddb chore: version number 2023-09-19 16:07:06 -05:00
Cullen Watson
0dae14ccfc fix: proxy add to session correctly 2023-09-19 16:05:14 -05:00
25 changed files with 52132 additions and 1743 deletions

1
.github/FUNDING.yml vendored Normal file
View File

@@ -0,0 +1 @@
github: Bunsly

21
.pre-commit-config.yaml Normal file
View File

@@ -0,0 +1,21 @@
---
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.2.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-added-large-files
- id: check-yaml
- repo: https://github.com/adrienverge/yamllint
rev: v1.29.0
hooks:
- id: yamllint
verbose: true # create awareness of linter findings
args: ["-d", "{extends: relaxed, rules: {line-length: {max: 120}}}"]
- repo: https://github.com/psf/black
rev: 24.2.0
hooks:
- id: black
language_version: python
args: [--line-length=120, --quiet]

View File

@@ -1,118 +0,0 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "cb48903e-5021-49fe-9688-45cd0bc05d0f",
"metadata": {},
"outputs": [],
"source": [
"from homeharvest import scrape_property\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "156488ce-0d5f-43c5-87f4-c33e9c427860",
"metadata": {},
"outputs": [],
"source": [
"pd.set_option('display.max_columns', None) # Show all columns\n",
"pd.set_option('display.max_rows', None) # Show all rows\n",
"pd.set_option('display.width', None) # Auto-adjust display width to fit console\n",
"pd.set_option('display.max_colwidth', 50) # Limit max column width to 50 characters"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1c8b9744-8606-4e9b-8add-b90371a249a7",
"metadata": {},
"outputs": [],
"source": [
"# scrapes all 3 sites by default\n",
"scrape_property(\n",
" location=\"dallas\",\n",
" listing_type=\"for_sale\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aaf86093",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"# search a specific address\n",
"scrape_property(\n",
" location=\"2530 Al Lipscomb Way\",\n",
" site_name=\"zillow\",\n",
" listing_type=\"for_sale\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ab7b4c21-da1d-4713-9df4-d7425d8ce21e",
"metadata": {},
"outputs": [],
"source": [
"# check rentals\n",
"scrape_property(\n",
" location=\"chicago, illinois\",\n",
" site_name=[\"redfin\", \"zillow\"],\n",
" listing_type=\"for_rent\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "af280cd3",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"# check sold properties\n",
"scrape_property(\n",
" location=\"90210\",\n",
" site_name=[\"redfin\"],\n",
" listing_type=\"sold\"\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

269
README.md
View File

@@ -1,165 +1,196 @@
<img src="https://github.com/ZacharyHampton/HomeHarvest/assets/78247585/d1a2bf8b-09f5-4c57-b33a-0ada8a34f12d" width="400"> <img src="https://github.com/ZacharyHampton/HomeHarvest/assets/78247585/d1a2bf8b-09f5-4c57-b33a-0ada8a34f12d" width="400">
**HomeHarvest** is a simple, yet comprehensive, real estate scraping library. **HomeHarvest** is a real estate scraping library that extracts and formats data in the style of MLS listings.
[![Try with Replit](https://replit.com/badge?caption=Try%20with%20Replit)](https://replit.com/@ZacharyHampton/HomeHarvestDemo) ## HomeHarvest Features
*Looking to build a data-focused software product?* **[Book a call](https://calendly.com/zachary-products/15min)** *to work with us.* - **Source**: Fetches properties directly from **Realtor.com**.
## Features - **Data Format**: Structures data to resemble MLS listings.
- **Export Flexibility**: Options to save as either CSV or Excel.
- Scrapes properties from **Zillow**, **Realtor.com** & **Redfin** simultaneously
- Aggregates the properties in a Pandas DataFrame
[Video Guide for HomeHarvest](https://www.youtube.com/watch?v=HCoHoiJdWQY)
![homeharvest](https://github.com/ZacharyHampton/HomeHarvest/assets/78247585/b3d5d727-e67b-4a9f-85d8-1e65fd18620a) ![homeharvest](https://github.com/ZacharyHampton/HomeHarvest/assets/78247585/b3d5d727-e67b-4a9f-85d8-1e65fd18620a)
## Installation ## Installation
```bash ```bash
pip install --force-reinstall homeharvest pip install -U homeharvest
``` ```
_Python version >= [3.10](https://www.python.org/downloads/release/python-3100/) required_ _Python version >= [3.9](https://www.python.org/downloads/release/python-3100/) required_
## Usage ## Usage
### CLI
```bash
homeharvest "San Francisco, CA" -s zillow realtor.com redfin -l for_rent -o excel -f HomeHarvest
```
This will scrape properties from the specified sites for the given location and listing type, and save the results to an Excel file named `HomeHarvest.xlsx`.
By default:
- If `-s` or `--site_name` is not provided, it will scrape from all available sites.
- If `-l` or `--listing_type` is left blank, the default is `for_sale`. Other options are `for_rent` or `sold`.
- The `-o` or `--output` default format is `excel`. Options are `csv` or `excel`.
- If `-f` or `--filename` is left blank, the default is `HomeHarvest_<current_timestamp>`.
- If `-p` or `--proxy` is not provided, the scraper uses the local IP.
### Python ### Python
```py ```py
from homeharvest import scrape_property from homeharvest import scrape_property
import pandas as pd from datetime import datetime
properties: pd.DataFrame = scrape_property( # Generate filename based on current timestamp
site_name=["zillow", "realtor.com", "redfin"], current_timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
location="85281", filename = f"HomeHarvest_{current_timestamp}.csv"
listing_type="for_rent" # for_sale / sold
properties = scrape_property(
location="San Diego, CA",
listing_type="sold", # or (for_sale, for_rent, pending)
past_days=30, # sold in last 30 days - listed in last 30 days if (for_sale, for_rent)
# property_type=['single_family','multi_family'],
# date_from="2023-05-01", # alternative to past_days
# date_to="2023-05-28",
# foreclosure=True
# mls_only=True, # only fetch MLS listings
) )
print(f"Number of properties: {len(properties)}")
#: Note, to export to CSV or Excel, use properties.to_csv() or properties.to_excel(). # Export to csv
print(properties) properties.to_csv(filename, index=False)
print(properties.head())
``` ```
## Output ## Output
```py ```plaintext
>>> properties.head() >>> properties.head()
property_url site_name listing_type apt_min_price apt_max_price ... MLS MLS # Status Style ... COEDate LotSFApx PrcSqft Stories
0 https://www.redfin.com/AZ/Tempe/1003-W-Washing... redfin for_rent 1666.0 2750.0 ... 0 SDCA 230018348 SOLD CONDOS ... 2023-10-03 290110 803 2
1 https://www.redfin.com/AZ/Tempe/VELA-at-Town-L... redfin for_rent 1665.0 3763.0 ... 1 SDCA 230016614 SOLD TOWNHOMES ... 2023-10-03 None 838 3
2 https://www.redfin.com/AZ/Tempe/Camden-Tempe/a... redfin for_rent 1939.0 3109.0 ... 2 SDCA 230016367 SOLD CONDOS ... 2023-10-03 30056 649 1
3 https://www.redfin.com/AZ/Tempe/Emerson-Park/a... redfin for_rent 1185.0 1817.0 ... 3 MRCA NDP2306335 SOLD SINGLE_FAMILY ... 2023-10-03 7519 661 2
4 https://www.redfin.com/AZ/Tempe/Rio-Paradiso-A... redfin for_rent 1470.0 2235.0 ... 4 SDCA 230014532 SOLD CONDOS ... 2023-10-03 None 752 1
[5 rows x 41 columns] [5 rows x 22 columns]
``` ```
### Parameters for `scrape_properties()` ### Parameters for `scrape_property()`
```plaintext ```
Required Required
├── location (str): address in various formats e.g. just zip, full address, city/state, etc. ├── location (str): The address in various formats - this could be just a zip code, a full address, or city/state, etc.
── listing_type (enum): for_rent, for_sale, sold ── listing_type (option): Choose the type of listing.
- 'for_rent'
- 'for_sale'
- 'sold'
- 'pending' (for pending/contingent sales)
Optional Optional
├── site_name (List[enum], default=all three sites): zillow, realtor.com, redfin ├── property_type (list): Choose the type of properties.
├── proxy (str): in format 'http://user:pass@host:port' or [https, socks] - 'single_family'
- 'multi_family'
- 'condos'
- 'condo_townhome_rowhome_coop'
- 'condo_townhome'
- 'townhomes'
- 'duplex_triplex'
- 'farm'
- 'land'
- 'mobile'
├── return_type (option): Choose the return type.
│ - 'pandas' (default)
│ - 'pydantic'
│ - 'raw' (json)
├── radius (decimal): Radius in miles to find comparable properties based on individual addresses.
│ Example: 5.5 (fetches properties within a 5.5-mile radius if location is set to a specific address; otherwise, ignored)
├── past_days (integer): Number of past days to filter properties. Utilizes 'last_sold_date' for 'sold' listing types, and 'list_date' for others (for_rent, for_sale).
│ Example: 30 (fetches properties listed/sold in the last 30 days)
├── date_from, date_to (string): Start and end dates to filter properties listed or sold, both dates are required.
| (use this to get properties in chunks as there's a 10k result limit)
│ Format for both must be "YYYY-MM-DD".
│ Example: "2023-05-01", "2023-05-15" (fetches properties listed/sold between these dates)
├── mls_only (True/False): If set, fetches only MLS listings (mainly applicable to 'sold' listings)
├── foreclosure (True/False): If set, fetches only foreclosures
├── proxy (string): In format 'http://user:pass@host:port'
├── extra_property_data (True/False): Increases requests by O(n). If set, this fetches additional property data for general searches (e.g. schools, tax appraisals etc.)
├── exclude_pending (True/False): If set, excludes 'pending' properties from the 'for_sale' results unless listing_type is 'pending'
└── limit (integer): Limit the number of properties to fetch. Max & default is 10000.
``` ```
### Property Schema ### Property Schema
```plaintext ```plaintext
Property Property
├── Basic Information: ├── Basic Information:
├── property_url (str) │ ├── property_url
├── site_name (enum): zillow, redfin, realtor.com │ ├── property_id
├── listing_type (enum: ListingType) │ ├── listing_id
└── property_type (enum): house, apartment, condo, townhouse, single_family, multi_family, building ├── mls
│ ├── mls_id
│ └── status
├── Address Details: ├── Address Details:
├── street_address (str) │ ├── street
├── city (str) │ ├── unit
├── state (str) │ ├── city
├── zip_code (str) │ ├── state
├── unit (str) └── zip_code
│ └── country (str)
├── Property Features: ├── Property Description:
├── price (int) │ ├── style
├── tax_assessed_value (int) │ ├── beds
├── currency (str) │ ├── full_baths
├── square_feet (int) │ ├── half_baths
├── beds (int) │ ├── sqft
├── baths (float) │ ├── year_built
├── lot_area_value (float) │ ├── stories
├── lot_area_unit (str) │ ├── garage
├── stories (int) └── lot_sqft
│ └── year_built (int)
├── Miscellaneous Details: ├── Property Listing Details:
├── price_per_sqft (int) │ ├── days_on_mls
├── mls_id (str) │ ├── list_price
├── agent_name (str) │ ├── list_price_min
├── img_src (str) │ ├── list_price_max
├── description (str) │ ├── list_date
├── status_text (str) │ ├── pending_date
├── latitude (float) │ ├── sold_price
├── longitude (float) │ ├── last_sold_date
── posted_time (str) [Only for Zillow] ── price_per_sqft
│ ├── new_construction
│ └── hoa_fee
├── Building Details (for property_type: building): ├── Tax Information:
├── bldg_name (str) │ ├── year
├── bldg_unit_count (int) │ ├── tax
├── bldg_min_beds (int) │ ├── assessment
│ ├── bldg_min_baths (float) │ ├── building
└── bldg_min_area (int) │ ├── land
│ │ └── total
├── Location Details:
│ ├── latitude
│ ├── longitude
│ ├── nearby_schools
├── Agent Info:
│ ├── agent_id
│ ├── agent_name
│ ├── agent_email
│ └── agent_phone
├── Broker Info:
│ ├── broker_id
│ └── broker_name
├── Builder Info:
│ ├── builder_id
│ └── builder_name
├── Office Info:
│ ├── office_id
│ ├── office_name
│ ├── office_phones
│ └── office_email
└── Apartment Details (for property type: apartment):
├── apt_min_beds: int
├── apt_max_beds: int
├── apt_min_baths: float
├── apt_max_baths: float
├── apt_min_price: int
├── apt_max_price: int
├── apt_min_sqft: int
├── apt_max_sqft: int
``` ```
## Supported Countries for Property Scraping
* **Zillow**: contains listings in the **US** & **Canada**
* **Realtor.com**: mainly from the **US** but also has international listings
* **Redfin**: listings mainly in the **US**, **Canada**, & has expanded to some areas in **Mexico**
### Exceptions ### Exceptions
The following exceptions may be raised when using HomeHarvest: The following exceptions may be raised when using HomeHarvest:
- `InvalidSite` - valid options: `zillow`, `redfin`, `realtor.com` - `InvalidListingType` - valid options: `for_sale`, `for_rent`, `sold`, `pending`.
- `InvalidListingType` - valid options: `for_sale`, `for_rent`, `sold` - `InvalidDate` - date_from or date_to is not in the format YYYY-MM-DD.
- `NoResultsFound` - no properties found from your input - `AuthenticationError` - Realtor.com token request failed.
- `GeoCoordsNotFound` - if Zillow scraper is not able to create geo-coordinates from the location you input
## Frequently Asked Questions
---
**Q: Encountering issues with your queries?**
**A:** Try a single site and/or broaden the location. If problems persist, [submit an issue](https://github.com/ZacharyHampton/HomeHarvest/issues).
---
**Q: Received a Forbidden 403 response code?**
**A:** This indicates that you have been blocked by the real estate site for sending too many requests. Currently, **Zillow** is particularly aggressive with blocking. We recommend:
- Waiting a few seconds between requests.
- Trying a VPN to change your IP address.
---

104
examples/price_of_land.py Normal file
View File

@@ -0,0 +1,104 @@
"""
This script scrapes sold and pending sold land listings in past year for a list of zip codes and saves the data to individual Excel files.
It adds two columns to the data: 'lot_acres' and 'ppa' (price per acre) for user to analyze average price of land in a zip code.
"""
import os
import pandas as pd
from homeharvest import scrape_property
def get_property_details(zip: str, listing_type):
properties = scrape_property(location=zip, listing_type=listing_type, property_type=["land"], past_days=365)
if not properties.empty:
properties["lot_acres"] = properties["lot_sqft"].apply(lambda x: x / 43560 if pd.notnull(x) else None)
properties = properties[properties["sqft"].isnull()]
properties["ppa"] = properties.apply(
lambda row: (
int(
(
row["sold_price"]
if (pd.notnull(row["sold_price"]) and row["status"] == "SOLD")
else row["list_price"]
)
/ row["lot_acres"]
)
if pd.notnull(row["lot_acres"])
and row["lot_acres"] > 0
and (pd.notnull(row["sold_price"]) or pd.notnull(row["list_price"]))
else None
),
axis=1,
)
properties["ppa"] = properties["ppa"].astype("Int64")
selected_columns = [
"property_url",
"property_id",
"style",
"status",
"street",
"city",
"state",
"zip_code",
"county",
"list_date",
"last_sold_date",
"list_price",
"sold_price",
"lot_sqft",
"lot_acres",
"ppa",
]
properties = properties[selected_columns]
return properties
def output_to_excel(zip_code, sold_df, pending_df):
root_folder = os.getcwd()
zip_folder = os.path.join(root_folder, "zips", zip_code)
# Create zip code folder if it doesn't exist
os.makedirs(zip_folder, exist_ok=True)
# Define file paths
sold_file = os.path.join(zip_folder, f"{zip_code}_sold.xlsx")
pending_file = os.path.join(zip_folder, f"{zip_code}_pending.xlsx")
# Save individual sold and pending files
sold_df.to_excel(sold_file, index=False)
pending_df.to_excel(pending_file, index=False)
zip_codes = map(
str,
[
22920,
77024,
78028,
24553,
22967,
22971,
22922,
22958,
22969,
22949,
22938,
24599,
24562,
22976,
24464,
22964,
24581,
],
)
combined_df = pd.DataFrame()
for zip in zip_codes:
sold_df = get_property_details(zip, "sold")
pending_df = get_property_details(zip, "pending")
combined_df = pd.concat([combined_df, sold_df, pending_df], ignore_index=True)
output_to_excel(zip, sold_df, pending_df)
combined_file = os.path.join(os.getcwd(), "zips", "combined.xlsx")
combined_df.to_excel(combined_file, index=False)

View File

@@ -1,188 +1,77 @@
import warnings
import pandas as pd import pandas as pd
from typing import Union
import concurrent.futures
from concurrent.futures import ThreadPoolExecutor
from .core.scrapers import ScraperInput from .core.scrapers import ScraperInput
from .core.scrapers.redfin import RedfinScraper from .utils import process_result, ordered_properties, validate_input, validate_dates, validate_limit
from .core.scrapers.realtor import RealtorScraper from .core.scrapers.realtor import RealtorScraper
from .core.scrapers.zillow import ZillowScraper from .core.scrapers.models import ListingType, SearchPropertyType, ReturnType, Property
from .core.scrapers.models import ListingType, Property, SiteName from typing import Union, Optional, List
from .exceptions import InvalidSite, InvalidListingType
_scrapers = {
"redfin": RedfinScraper,
"realtor.com": RealtorScraper,
"zillow": ZillowScraper,
}
def _validate_input(site_name: str, listing_type: str) -> None:
if site_name.lower() not in _scrapers:
raise InvalidSite(f"Provided site, '{site_name}', does not exist.")
if listing_type.upper() not in ListingType.__members__:
raise InvalidListingType(
f"Provided listing type, '{listing_type}', does not exist."
)
def _get_ordered_properties(result: Property) -> list[str]:
return [
"property_url",
"site_name",
"listing_type",
"property_type",
"status_text",
"currency",
"price",
"apt_min_price",
"apt_max_price",
"apt_min_sqft",
"apt_max_sqft",
"apt_min_beds",
"apt_max_beds",
"apt_min_baths",
"apt_max_baths",
"tax_assessed_value",
"square_feet",
"price_per_sqft",
"beds",
"baths",
"lot_area_value",
"lot_area_unit",
"street_address",
"unit",
"city",
"state",
"zip_code",
"country",
"posted_time",
"bldg_min_beds",
"bldg_min_baths",
"bldg_min_area",
"bldg_unit_count",
"bldg_name",
"stories",
"year_built",
"agent_name",
"mls_id",
"img_src",
"latitude",
"longitude",
"description",
]
def _process_result(result: Property) -> pd.DataFrame:
prop_data = result.__dict__
prop_data["site_name"] = prop_data["site_name"].value
prop_data["listing_type"] = prop_data["listing_type"].value.lower()
if "property_type" in prop_data and prop_data["property_type"] is not None:
prop_data["property_type"] = prop_data["property_type"].value.lower()
else:
prop_data["property_type"] = None
if "address" in prop_data:
address_data = prop_data["address"]
prop_data["street_address"] = address_data.street_address
prop_data["unit"] = address_data.unit
prop_data["city"] = address_data.city
prop_data["state"] = address_data.state
prop_data["zip_code"] = address_data.zip_code
prop_data["country"] = address_data.country
del prop_data["address"]
properties_df = pd.DataFrame([prop_data])
properties_df = properties_df[_get_ordered_properties(result)]
return properties_df
def _scrape_single_site(
location: str, site_name: str, listing_type: str, proxy: str = None
) -> pd.DataFrame:
"""
Helper function to scrape a single site.
"""
_validate_input(site_name, listing_type)
scraper_input = ScraperInput(
location=location,
listing_type=ListingType[listing_type.upper()],
site_name=SiteName.get_by_value(site_name.lower()),
proxy=proxy,
)
site = _scrapers[site_name.lower()](scraper_input)
results = site.search()
properties_dfs = [_process_result(result) for result in results]
properties_dfs = [
df.dropna(axis=1, how="all") for df in properties_dfs if not df.empty
]
if not properties_dfs:
return pd.DataFrame()
return pd.concat(properties_dfs, ignore_index=True)
def scrape_property( def scrape_property(
location: str, location: str,
site_name: Union[str, list[str]] = None,
listing_type: str = "for_sale", listing_type: str = "for_sale",
return_type: str = "pandas",
property_type: Optional[List[str]] = None,
radius: float = None,
mls_only: bool = False,
past_days: int = None,
proxy: str = None, proxy: str = None,
) -> pd.DataFrame: date_from: str = None, #: TODO: Switch to one parameter, Date, with date_from and date_to, pydantic validation
date_to: str = None,
foreclosure: bool = None,
extra_property_data: bool = True,
exclude_pending: bool = False,
limit: int = 10000
) -> Union[pd.DataFrame, list[dict], list[Property]]:
""" """
Scrape property from various sites from a given location and listing type. Scrape properties from Realtor.com based on a given location and listing type.
:param location: Location to search (e.g. "Dallas, TX", "85281", "2530 Al Lipscomb Way")
:returns: pd.DataFrame :param listing_type: Listing Type (for_sale, for_rent, sold, pending)
:param location: US Location (e.g. 'San Francisco, CA', 'Cook County, IL', '85281', '2530 Al Lipscomb Way') :param return_type: Return type (pandas, pydantic, raw)
:param site_name: Site name or list of site names (e.g. ['realtor.com', 'zillow'], 'redfin') :param property_type: Property Type (single_family, multi_family, condos, condo_townhome_rowhome_coop, condo_townhome, townhomes, duplex_triplex, farm, land, mobile)
:param listing_type: Listing type (e.g. 'for_sale', 'for_rent', 'sold') :param radius: Get properties within _ (e.g. 1.0) miles. Only applicable for individual addresses.
:return: pd.DataFrame containing properties :param mls_only: If set, fetches only listings with MLS IDs.
:param proxy: Proxy to use for scraping
:param past_days: Get properties sold or listed (dependent on your listing_type) in the last _ days.
:param date_from, date_to: Get properties sold or listed (dependent on your listing_type) between these dates. format: 2021-01-28
:param foreclosure: If set, fetches only foreclosure listings.
:param extra_property_data: Increases requests by O(n). If set, this fetches additional property data (e.g. agent, broker, property evaluations etc.)
:param exclude_pending: If true, this excludes pending or contingent properties from the results, unless listing type is pending.
:param limit: Limit the number of results returned. Maximum is 10,000.
""" """
if site_name is None: validate_input(listing_type)
site_name = list(_scrapers.keys()) validate_dates(date_from, date_to)
validate_limit(limit)
if not isinstance(site_name, list): scraper_input = ScraperInput(
site_name = [site_name] location=location,
listing_type=ListingType(listing_type.upper()),
return_type=ReturnType(return_type.lower()),
property_type=[SearchPropertyType[prop.upper()] for prop in property_type] if property_type else None,
proxy=proxy,
radius=radius,
mls_only=mls_only,
last_x_days=past_days,
date_from=date_from,
date_to=date_to,
foreclosure=foreclosure,
extra_property_data=extra_property_data,
exclude_pending=exclude_pending,
limit=limit,
)
results = [] site = RealtorScraper(scraper_input)
results = site.search()
if len(site_name) == 1: if scraper_input.return_type != ReturnType.pandas:
final_df = _scrape_single_site(location, site_name[0], listing_type, proxy) return results
results.append(final_df)
else:
with ThreadPoolExecutor() as executor:
futures = {
executor.submit(
_scrape_single_site, location, s_name, listing_type, proxy
): s_name
for s_name in site_name
}
for future in concurrent.futures.as_completed(futures): properties_dfs = [df for result in results if not (df := process_result(result)).empty]
result = future.result() if not properties_dfs:
results.append(result)
results = [df for df in results if not df.empty and not df.isna().all().all()]
if not results:
return pd.DataFrame() return pd.DataFrame()
final_df = pd.concat(results, ignore_index=True) with warnings.catch_warnings():
warnings.simplefilter("ignore", category=FutureWarning)
columns_to_track = ["street_address", "city", "unit"] return pd.concat(properties_dfs, ignore_index=True, axis=0)[ordered_properties].replace(
{"None": pd.NA, None: pd.NA, "": pd.NA}
#: validate they exist, otherwise create them
for col in columns_to_track:
if col not in final_df.columns:
final_df[col] = None
final_df = final_df.drop_duplicates(
subset=["street_address", "city", "unit"], keep="first"
) )
return final_df

View File

@@ -5,25 +5,14 @@ from homeharvest import scrape_property
def main(): def main():
parser = argparse.ArgumentParser(description="Home Harvest Property Scraper") parser = argparse.ArgumentParser(description="Home Harvest Property Scraper")
parser.add_argument( parser.add_argument("location", type=str, help="Location to scrape (e.g., San Francisco, CA)")
"location", type=str, help="Location to scrape (e.g., San Francisco, CA)"
)
parser.add_argument(
"-s",
"--site_name",
type=str,
nargs="*",
default=None,
help="Site name(s) to scrape from (e.g., realtor, zillow)",
)
parser.add_argument( parser.add_argument(
"-l", "-l",
"--listing_type", "--listing_type",
type=str, type=str,
default="for_sale", default="for_sale",
choices=["for_sale", "for_rent", "sold"], choices=["for_sale", "for_rent", "sold", "pending"],
help="Listing type to scrape", help="Listing type to scrape",
) )
@@ -44,14 +33,38 @@ def main():
help="Name of the output file (without extension)", help="Name of the output file (without extension)",
) )
parser.add_argument("-p", "--proxy", type=str, default=None, help="Proxy to use for scraping")
parser.add_argument( parser.add_argument(
"-p", "--proxy", type=str, default=None, help="Proxy to use for scraping" "-d",
"--days",
type=int,
default=None,
help="Sold/listed in last _ days filter.",
)
parser.add_argument(
"-r",
"--radius",
type=float,
default=None,
help="Get comparable properties within _ (eg. 0.0) miles. Only applicable for individual addresses.",
)
parser.add_argument(
"-m",
"--mls_only",
action="store_true",
help="If set, fetches only MLS listings.",
) )
args = parser.parse_args() args = parser.parse_args()
result = scrape_property( result = scrape_property(
args.location, args.site_name, args.listing_type, proxy=args.proxy args.location,
args.listing_type,
radius=args.radius,
proxy=args.proxy,
mls_only=args.mls_only,
past_days=args.days,
) )
if not args.filename: if not args.filename:

View File

@@ -1,31 +1,127 @@
from dataclasses import dataclass from __future__ import annotations
from typing import Union
import requests import requests
from .models import Property, ListingType, SiteName from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
import uuid
from ...exceptions import AuthenticationError
from .models import Property, ListingType, SiteName, SearchPropertyType, ReturnType
import json
from pydantic import BaseModel
@dataclass class ScraperInput(BaseModel):
class ScraperInput:
location: str location: str
listing_type: ListingType listing_type: ListingType
site_name: SiteName property_type: list[SearchPropertyType] | None = None
radius: float | None = None
mls_only: bool | None = False
proxy: str | None = None proxy: str | None = None
last_x_days: int | None = None
date_from: str | None = None
date_to: str | None = None
foreclosure: bool | None = False
extra_property_data: bool | None = True
exclude_pending: bool | None = False
limit: int = 10000
return_type: ReturnType = ReturnType.pandas
class Scraper: class Scraper:
def __init__(self, scraper_input: ScraperInput): session = None
def __init__(
self,
scraper_input: ScraperInput,
):
self.location = scraper_input.location self.location = scraper_input.location
self.listing_type = scraper_input.listing_type self.listing_type = scraper_input.listing_type
self.property_type = scraper_input.property_type
if not self.session:
Scraper.session = requests.Session()
retries = Retry(
total=3, backoff_factor=4, status_forcelist=[429, 403], allowed_methods=frozenset(["GET", "POST"])
)
adapter = HTTPAdapter(max_retries=retries)
Scraper.session.mount("http://", adapter)
Scraper.session.mount("https://", adapter)
Scraper.session.headers.update(
{
"accept": "application/json, text/javascript",
"accept-language": "en-US,en;q=0.9",
"cache-control": "no-cache",
"content-type": "application/json",
"origin": "https://www.realtor.com",
"pragma": "no-cache",
"priority": "u=1, i",
"rdc-ab-tests": "commute_travel_time_variation:v1",
"sec-ch-ua": '"Not)A;Brand";v="99", "Google Chrome";v="127", "Chromium";v="127"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"Windows"',
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-origin",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36",
}
)
if scraper_input.proxy:
proxy_url = scraper_input.proxy
proxies = {"http": proxy_url, "https": proxy_url}
self.session.proxies.update(proxies)
self.session = requests.Session(proxies=scraper_input.proxy)
self.listing_type = scraper_input.listing_type self.listing_type = scraper_input.listing_type
self.site_name = scraper_input.site_name self.radius = scraper_input.radius
self.last_x_days = scraper_input.last_x_days
self.mls_only = scraper_input.mls_only
self.date_from = scraper_input.date_from
self.date_to = scraper_input.date_to
self.foreclosure = scraper_input.foreclosure
self.extra_property_data = scraper_input.extra_property_data
self.exclude_pending = scraper_input.exclude_pending
self.limit = scraper_input.limit
self.return_type = scraper_input.return_type
def search(self) -> list[Property]: def search(self) -> list[Union[Property | dict]]: ...
...
@staticmethod @staticmethod
def _parse_home(home) -> Property: def _parse_home(home) -> Property: ...
...
def handle_location(self): def handle_location(self): ...
...
@staticmethod
def get_access_token():
device_id = str(uuid.uuid4()).upper()
response = requests.post(
"https://graph.realtor.com/auth/token",
headers={
"Host": "graph.realtor.com",
"Accept": "*/*",
"Content-Type": "Application/json",
"X-Client-ID": "rdc_mobile_native,iphone",
"X-Visitor-ID": device_id,
"X-Client-Version": "24.21.23.679885",
"Accept-Language": "en-US,en;q=0.9",
"User-Agent": "Realtor.com/24.21.23.679885 CFNetwork/1494.0.7 Darwin/23.4.0",
},
data=json.dumps(
{
"grant_type": "device_mobile",
"device_id": device_id,
"client_app_id": "rdc_mobile_native,24.21.23.679885,iphone",
}
),
)
data = response.json()
if not (access_token := data.get("access_token")):
raise AuthenticationError(
"Failed to get access token, use a proxy/vpn or wait a moment and try again.", response=response
)
return access_token

View File

@@ -1,5 +1,14 @@
from dataclasses import dataclass from __future__ import annotations
from enum import Enum from enum import Enum
from typing import Optional, Any
from datetime import datetime
from pydantic import BaseModel, computed_field, HttpUrl, Field
class ReturnType(Enum):
pydantic = "pydantic"
pandas = "pandas"
raw = "raw"
class SiteName(Enum): class SiteName(Enum):
@@ -15,98 +24,344 @@ class SiteName(Enum):
raise ValueError(f"{value} not found in {cls}") raise ValueError(f"{value} not found in {cls}")
class SearchPropertyType(Enum):
SINGLE_FAMILY = "single_family"
APARTMENT = "apartment"
CONDOS = "condos"
CONDO_TOWNHOME_ROWHOME_COOP = "condo_townhome_rowhome_coop"
CONDO_TOWNHOME = "condo_townhome"
TOWNHOMES = "townhomes"
DUPLEX_TRIPLEX = "duplex_triplex"
FARM = "farm"
LAND = "land"
MULTI_FAMILY = "multi_family"
MOBILE = "mobile"
class ListingType(Enum): class ListingType(Enum):
FOR_SALE = "FOR_SALE" FOR_SALE = "FOR_SALE"
FOR_RENT = "FOR_RENT" FOR_RENT = "FOR_RENT"
PENDING = "PENDING"
SOLD = "SOLD" SOLD = "SOLD"
class PropertyType(Enum): class PropertyType(Enum):
HOUSE = "HOUSE"
BUILDING = "BUILDING"
CONDO = "CONDO"
TOWNHOUSE = "TOWNHOUSE"
SINGLE_FAMILY = "SINGLE_FAMILY"
MULTI_FAMILY = "MULTI_FAMILY"
MANUFACTURED = "MANUFACTURED"
NEW_CONSTRUCTION = "NEW_CONSTRUCTION"
APARTMENT = "APARTMENT" APARTMENT = "APARTMENT"
APARTMENTS = "APARTMENTS" BUILDING = "BUILDING"
COMMERCIAL = "COMMERCIAL"
GOVERNMENT = "GOVERNMENT"
INDUSTRIAL = "INDUSTRIAL"
CONDO_TOWNHOME = "CONDO_TOWNHOME"
CONDO_TOWNHOME_ROWHOME_COOP = "CONDO_TOWNHOME_ROWHOME_COOP"
CONDO = "CONDO"
CONDOP = "CONDOP"
CONDOS = "CONDOS"
COOP = "COOP"
DUPLEX_TRIPLEX = "DUPLEX_TRIPLEX"
FARM = "FARM"
INVESTMENT = "INVESTMENT"
LAND = "LAND" LAND = "LAND"
LOT = "LOT" MOBILE = "MOBILE"
MULTI_FAMILY = "MULTI_FAMILY"
RENTAL = "RENTAL"
SINGLE_FAMILY = "SINGLE_FAMILY"
TOWNHOMES = "TOWNHOMES"
OTHER = "OTHER" OTHER = "OTHER"
BLANK = "BLANK"
@classmethod class Address(BaseModel):
def from_int_code(cls, code): full_line: str | None = None
mapping = { street: str | None = None
1: cls.HOUSE,
2: cls.CONDO,
3: cls.TOWNHOUSE,
4: cls.MULTI_FAMILY,
5: cls.LAND,
6: cls.OTHER,
8: cls.SINGLE_FAMILY,
13: cls.SINGLE_FAMILY,
}
return mapping.get(code, cls.BLANK)
@dataclass
class Address:
street_address: str
city: str
state: str
zip_code: str
unit: str | None = None unit: str | None = None
country: str | None = None city: str | None = Field(None, description="The name of the city")
state: str | None = Field(None, description="The name of the state")
zip: str | None = Field(None, description="zip code")
# Additional address fields from GraphQL
street_direction: str | None = None
street_number: str | None = None
street_name: str | None = None
street_suffix: str | None = None
@computed_field
@property
def formatted_address(self) -> str | None:
"""Computed property that combines full_line, city, state, and zip into a formatted address."""
parts = []
if self.full_line:
parts.append(self.full_line)
city_state_zip = []
if self.city:
city_state_zip.append(self.city)
if self.state:
city_state_zip.append(self.state)
if self.zip:
city_state_zip.append(self.zip)
if city_state_zip:
parts.append(", ".join(city_state_zip))
return ", ".join(parts) if parts else None
@dataclass
class Property:
property_url: str
site_name: SiteName
listing_type: ListingType
address: Address
property_type: PropertyType | None = None
# house for sale
price: int | None = None class Description(BaseModel):
tax_assessed_value: int | None = None primary_photo: HttpUrl | None = None
currency: str | None = None alt_photos: list[HttpUrl] | None = None
square_feet: int | None = None style: PropertyType | None = None
beds: int | None = None beds: int | None = Field(None, description="Total number of bedrooms")
baths: float | None = None baths_full: int | None = Field(None, description="Total number of full bathrooms (4 parts: Sink, Shower, Bathtub and Toilet)")
lot_area_value: float | None = None baths_half: int | None = Field(None, description="Total number of 1/2 bathrooms (2 parts: Usually Sink and Toilet)")
lot_area_unit: str | None = None sqft: int | None = Field(None, description="Square footage of the Home")
stories: int | None = None lot_sqft: int | None = Field(None, description="Lot square footage")
year_built: int | None = None sold_price: int | None = Field(None, description="Sold price of home")
price_per_sqft: int | None = None year_built: int | None = Field(None, description="The year the building/home was built")
garage: float | None = Field(None, description="Number of garage spaces")
stories: int | None = Field(None, description="Number of stories in the building")
text: str | None = None
# Additional description fields
name: str | None = None
type: str | None = None
class AgentPhone(BaseModel):
number: str | None = None
type: str | None = None
primary: bool | None = None
ext: str | None = None
class Entity(BaseModel):
name: str | None = None # Make name optional since it can be None
uuid: str | None = None
class Agent(Entity):
mls_set: str | None = None
nrds_id: str | None = None
phones: list[dict] | AgentPhone | None = None
email: str | None = None
href: str | None = None
state_license: str | None = Field(None, description="Advertiser agent state license number")
class Office(Entity):
mls_set: str | None = None
email: str | None = None
href: str | None = None
phones: list[dict] | AgentPhone | None = None
class Broker(Entity):
pass
class Builder(Entity):
pass
class Advertisers(BaseModel):
agent: Agent | None = None
broker: Broker | None = None
builder: Builder | None = None
office: Office | None = None
class Property(BaseModel):
property_url: HttpUrl
property_id: str = Field(..., description="Unique Home identifier also known as property id")
#: allows_cats: bool
#: allows_dogs: bool
listing_id: str | None = None
permalink: str | None = None
mls: str | None = None
mls_id: str | None = None mls_id: str | None = None
status: str | None = Field(None, description="Listing status: for_sale, for_rent, sold, off_market, active (New Home Subdivisions), other (if none of the above conditions were met)")
address: Address | None = None
list_price: int | None = Field(None, description="The current price of the Home")
list_price_min: int | None = None
list_price_max: int | None = None
list_date: datetime | None = Field(None, description="The time this Home entered Move system")
pending_date: datetime | None = Field(None, description="The date listing went into pending state")
last_sold_date: datetime | None = Field(None, description="Last time the Home was sold")
prc_sqft: int | None = None
new_construction: bool | None = Field(None, description="Search for new construction homes")
hoa_fee: int | None = Field(None, description="Search for homes where HOA fee is known and falls within specified range")
days_on_mls: int | None = Field(None, description="An integer value determined by the MLS to calculate days on market")
description: Description | None = None
tags: list[str] | None = None
details: list[HomeDetails] | None = None
agent_name: str | None = None
img_src: str | None = None
description: str | None = None
status_text: str | None = None
latitude: float | None = None latitude: float | None = None
longitude: float | None = None longitude: float | None = None
posted_time: str | None = None neighborhoods: Optional[str] = None
county: Optional[str] = Field(None, description="County associated with home")
fips_code: Optional[str] = Field(None, description="The FIPS (Federal Information Processing Standard) code for the county")
nearby_schools: list[str] | None = None
assessed_value: int | None = None
estimated_value: int | None = None
tax: int | None = None
tax_history: list[TaxHistory] | None = None
# building for sale advertisers: Advertisers | None = None
bldg_name: str | None = None
bldg_unit_count: int | None = None
bldg_min_beds: int | None = None
bldg_min_baths: float | None = None
bldg_min_area: int | None = None
# apt # Additional fields from GraphQL that aren't currently parsed
apt_min_beds: int | None = None mls_status: str | None = None
apt_max_beds: int | None = None last_sold_price: int | None = None
apt_min_baths: float | None = None
apt_max_baths: float | None = None # Structured data from GraphQL
apt_min_price: int | None = None open_houses: list[OpenHouse] | None = None
apt_max_price: int | None = None pet_policy: PetPolicy | None = None
apt_min_sqft: int | None = None units: list[Unit] | None = None
apt_max_sqft: int | None = None monthly_fees: HomeMonthlyFee | None = Field(None, description="Monthly fees. Currently only some rental data will have them.")
one_time_fees: list[HomeOneTimeFee] | None = Field(None, description="One time fees. Currently only some rental data will have them.")
parking: HomeParkingDetails | None = Field(None, description="Parking information. Currently only some rental data will have it.")
terms: list[PropertyDetails] | None = None
popularity: Popularity | None = None
tax_record: TaxRecord | None = None
parcel_info: dict | None = None # Keep as dict for flexibility
current_estimates: list[PropertyEstimate] | None = None
estimates: HomeEstimates | None = None
photos: list[dict] | None = None # Keep as dict for photo structure
flags: HomeFlags | None = Field(None, description="Home flags for Listing/Property")
# Specialized models for GraphQL types
class HomeMonthlyFee(BaseModel):
description: str | None = None
display_amount: str | None = None
class HomeOneTimeFee(BaseModel):
description: str | None = None
display_amount: str | None = None
class HomeParkingDetails(BaseModel):
unassigned_space_rent: int | None = None
assigned_spaces_available: int | None = None
description: str | None = Field(None, description="Parking information. Currently only some rental data will have it.")
assigned_space_rent: int | None = None
class PetPolicy(BaseModel):
cats: bool | None = Field(None, description="Search for homes which allow cats")
dogs: bool | None = Field(None, description="Search for homes which allow dogs")
dogs_small: bool | None = Field(None, description="Search for homes with allow small dogs")
dogs_large: bool | None = Field(None, description="Search for homes which allow large dogs")
class OpenHouse(BaseModel):
start_date: datetime | None = None
end_date: datetime | None = None
description: str | None = None
time_zone: str | None = None
dst: bool | None = None
href: HttpUrl | None = None
methods: list[str] | None = None
class HomeFlags(BaseModel):
is_pending: bool | None = None
is_contingent: bool | None = None
is_new_construction: bool | None = None
is_coming_soon: bool | None = None
is_new_listing: bool | None = None
is_price_reduced: bool | None = None
is_foreclosure: bool | None = None
class PopularityPeriod(BaseModel):
clicks_total: int | None = None
views_total: int | None = None
dwell_time_mean: float | None = None
dwell_time_median: float | None = None
leads_total: int | None = None
shares_total: int | None = None
saves_total: int | None = None
last_n_days: int | None = None
class Popularity(BaseModel):
periods: list[PopularityPeriod] | None = None
class Assessment(BaseModel):
building: int | None = None
land: int | None = None
total: int | None = None
class TaxHistory(BaseModel):
assessment: Assessment | None = None
market: Assessment | None = Field(None, description="Market values as provided by the county or local taxing/assessment authority")
appraisal: Assessment | None = Field(None, description="Appraised value given by taxing authority")
value: Assessment | None = Field(None, description="Value closest to current market value used for assessment by county or local taxing authorities")
tax: int | None = None
year: int | None = None
assessed_year: int | None = Field(None, description="Assessment year for which taxes were billed")
class TaxRecord(BaseModel):
cl_id: str | None = None
public_record_id: str | None = None
last_update_date: datetime | None = None
apn: str | None = None
tax_parcel_id: str | None = None
class EstimateSource(BaseModel):
type: str | None = Field(None, description="Type of the avm vendor, list of values: corelogic, collateral, quantarium")
name: str | None = Field(None, description="Name of the avm vendor")
class PropertyEstimate(BaseModel):
estimate: int | None = Field(None, description="Estimated value of a property")
estimate_high: int | None = Field(None, description="Estimated high value of a property")
estimate_low: int | None = Field(None, description="Estimated low value of a property")
date: datetime | None = Field(None, description="Date of estimation")
is_best_home_value: bool | None = None
source: EstimateSource | None = Field(None, description="Source of the latest estimate value")
class HomeEstimates(BaseModel):
current_values: list[PropertyEstimate] | None = Field(None, description="Current valuation and best value for home from multiple AVM vendors")
class PropertyDetails(BaseModel):
category: str | None = None
text: list[str] | None = None
parent_category: str | None = None
class HomeDetails(BaseModel):
category: str | None = None
text: list[str] | None = None
parent_category: str | None = None
class UnitDescription(BaseModel):
baths_consolidated: str | None = None
baths: float | None = None # Changed to float to handle values like 2.5
beds: int | None = None
sqft: int | None = None
class UnitAvailability(BaseModel):
date: datetime | None = None
class Unit(BaseModel):
availability: UnitAvailability | None = None
description: UnitDescription | None = None
photos: list[dict] | None = None # Keep as dict for photo structure
list_price: int | None = None

View File

@@ -1,33 +1,51 @@
"""
homeharvest.realtor.__init__
~~~~~~~~~~~~
This module implements the scraper for realtor.com
"""
from __future__ import annotations
import json import json
from ..models import Property, Address
from .. import Scraper
from typing import Any, Generator
from ....exceptions import NoResultsFound
from ....utils import parse_address_two, parse_unit
from concurrent.futures import ThreadPoolExecutor, as_completed from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime
from json import JSONDecodeError
from typing import Dict, Union
from tenacity import (
retry,
retry_if_exception_type,
wait_exponential,
stop_after_attempt,
)
from .. import Scraper
from ..models import (
Property,
ListingType,
ReturnType
)
from .queries import GENERAL_RESULTS_QUERY, SEARCH_HOMES_DATA, HOMES_DATA, HOME_FRAGMENT
from .processors import (
process_property,
process_extra_property_details,
get_key
)
class RealtorScraper(Scraper): class RealtorScraper(Scraper):
SEARCH_GQL_URL = "https://www.realtor.com/api/v1/rdc_search_srp?client_id=rdc-search-new-communities&schema=vesta"
PROPERTY_URL = "https://www.realtor.com/realestateandhomes-detail/"
PROPERTY_GQL = "https://graph.realtor.com/graphql"
ADDRESS_AUTOCOMPLETE_URL = "https://parser-external.geo.moveaws.com/suggest"
NUM_PROPERTY_WORKERS = 20
DEFAULT_PAGE_SIZE = 200
def __init__(self, scraper_input): def __init__(self, scraper_input):
super().__init__(scraper_input) super().__init__(scraper_input)
self.search_url = "https://www.realtor.com/api/v1/rdc_search_srp?client_id=rdc-search-new-communities&schema=vesta"
def handle_location(self): def handle_location(self):
headers = {
"authority": "parser-external.geo.moveaws.com",
"accept": "*/*",
"accept-language": "en-US,en;q=0.9",
"origin": "https://www.realtor.com",
"referer": "https://www.realtor.com/",
"sec-ch-ua": '"Chromium";v="116", "Not)A;Brand";v="24", "Google Chrome";v="116"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"Windows"',
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "cross-site",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36",
}
params = { params = {
"input": self.location, "input": self.location,
"client_id": self.listing_type.value.lower().replace("_", "-"), "client_id": self.listing_type.value.lower().replace("_", "-"),
@@ -36,117 +54,151 @@ class RealtorScraper(Scraper):
} }
response = self.session.get( response = self.session.get(
"https://parser-external.geo.moveaws.com/suggest", self.ADDRESS_AUTOCOMPLETE_URL,
params=params, params=params,
headers=headers,
) )
response_json = response.json() response_json = response.json()
result = response_json["autocomplete"] result = response_json["autocomplete"]
if not result: if not result:
raise NoResultsFound("No results found for location: " + self.location) return None
return result[0] return result[0]
def handle_address(self, property_id: str) -> list[Property]: def get_latest_listing_id(self, property_id: str) -> str | None:
query = """query Property($property_id: ID!) { query = """query Property($property_id: ID!) {
property(id: $property_id) { property(id: $property_id) {
property_id listings {
details { listing_id
date_updated primary
garage
permalink
year_built
stories
}
address {
address_validation_code
city
country
county
line
postal_code
state_code
street_direction
street_name
street_number
street_suffix
street_post_direction
unit_value
unit
unit_descriptor
zip
}
basic {
baths
beds
price
sqft
lot_sqft
type
sold_price
}
public_record {
lot_size
sqft
stories
units
year_built
} }
} }
}""" }
"""
variables = {"property_id": property_id} variables = {"property_id": property_id}
payload = { payload = {
"query": query, "query": query,
"variables": variables, "variables": variables,
} }
response = self.session.post(self.search_url, json=payload) response = self.session.post(self.SEARCH_GQL_URL, json=payload)
response_json = response.json() response_json = response.json()
property_info = response_json["data"]["property"] property_info = response_json["data"]["property"]
street_address, unit = parse_address_two(property_info["address"]["line"]) if property_info["listings"] is None:
return None
return [ primary_listing = next(
Property( (listing for listing in property_info["listings"] if listing["primary"]),
site_name=self.site_name, None,
address=Address(
street_address=street_address,
city=property_info["address"]["city"],
state=property_info["address"]["state_code"],
zip_code=property_info["address"]["postal_code"],
unit=unit,
country="USA",
),
property_url="https://www.realtor.com/realestateandhomes-detail/"
+ property_info["details"]["permalink"],
beds=property_info["basic"]["beds"],
baths=property_info["basic"]["baths"],
stories=property_info["details"]["stories"],
year_built=property_info["details"]["year_built"],
square_feet=property_info["basic"]["sqft"],
price_per_sqft=property_info["basic"]["price"]
// property_info["basic"]["sqft"]
if property_info["basic"]["sqft"] is not None
and property_info["basic"]["price"] is not None
else None,
price=property_info["basic"]["price"],
mls_id=property_id,
listing_type=self.listing_type,
lot_area_value=property_info["public_record"]["lot_size"]
if property_info["public_record"] is not None
else None,
) )
] if primary_listing:
return primary_listing["listing_id"]
else:
return property_info["listings"][0]["listing_id"]
def handle_area( def handle_home(self, property_id: str) -> list[Property]:
self, variables: dict, return_total: bool = False
) -> list[Property] | int:
query = ( query = (
"""query Home_search( """query Home($property_id: ID!) {
home(property_id: $property_id) %s
}"""
% HOMES_DATA
)
variables = {"property_id": property_id}
payload = {
"query": query,
"variables": variables,
}
response = self.session.post(self.SEARCH_GQL_URL, json=payload)
response_json = response.json()
property_info = response_json["data"]["home"]
if self.return_type != ReturnType.raw:
return [process_property(property_info, self.mls_only, self.extra_property_data,
self.exclude_pending, self.listing_type, get_key, process_extra_property_details)]
else:
return [property_info]
def general_search(self, variables: dict, search_type: str) -> Dict[str, Union[int, Union[list[Property], list[dict]]]]:
"""
Handles a location area & returns a list of properties
"""
date_param = ""
if self.listing_type == ListingType.SOLD:
if self.date_from and self.date_to:
date_param = f'sold_date: {{ min: "{self.date_from}", max: "{self.date_to}" }}'
elif self.last_x_days:
date_param = f'sold_date: {{ min: "$today-{self.last_x_days}D" }}'
else:
if self.date_from and self.date_to:
date_param = f'list_date: {{ min: "{self.date_from}", max: "{self.date_to}" }}'
elif self.last_x_days:
date_param = f'list_date: {{ min: "$today-{self.last_x_days}D" }}'
property_type_param = ""
if self.property_type:
property_types = [pt.value for pt in self.property_type]
property_type_param = f"type: {json.dumps(property_types)}"
sort_param = (
"sort: [{ field: sold_date, direction: desc }]"
if self.listing_type == ListingType.SOLD
else "" #: "sort: [{ field: list_date, direction: desc }]" #: prioritize normal fractal sort from realtor
)
pending_or_contingent_param = (
"or_filters: { contingent: true, pending: true }" if self.listing_type == ListingType.PENDING else ""
)
listing_type = ListingType.FOR_SALE if self.listing_type == ListingType.PENDING else self.listing_type
is_foreclosure = ""
if variables.get("foreclosure") is True:
is_foreclosure = "foreclosure: true"
elif variables.get("foreclosure") is False:
is_foreclosure = "foreclosure: false"
if search_type == "comps": #: comps search, came from an address
query = """query Property_search(
$coordinates: [Float]!
$radius: String!
$offset: Int!,
) {
home_search(
query: {
%s
nearby: {
coordinates: $coordinates
radius: $radius
}
status: %s
%s
%s
%s
}
%s
limit: 200
offset: $offset
) %s
}""" % (
is_foreclosure,
listing_type.value.lower(),
date_param,
property_type_param,
pending_or_contingent_param,
sort_param,
GENERAL_RESULTS_QUERY,
)
elif search_type == "area": #: general search, came from a general location
query = """query Home_search(
$city: String, $city: String,
$county: [String], $county: [String],
$state_code: String, $state_code: String,
@@ -155,61 +207,45 @@ class RealtorScraper(Scraper):
) { ) {
home_search( home_search(
query: { query: {
%s
city: $city city: $city
county: $county county: $county
postal_code: $postal_code postal_code: $postal_code
state_code: $state_code state_code: $state_code
status: %s status: %s
%s
%s
%s
} }
bucket: { sort: "fractal_v1.1.3_fr" }
%s
limit: 200 limit: 200
offset: $offset offset: $offset
) %s
}""" % (
is_foreclosure,
listing_type.value.lower(),
date_param,
property_type_param,
pending_or_contingent_param,
sort_param,
GENERAL_RESULTS_QUERY,
)
else: #: general search, came from an address
query = (
"""query Property_search(
$property_id: [ID]!
$offset: Int!,
) { ) {
count home_search(
total query: {
results { property_id: $property_id
property_id
description {
baths
beds
lot_sqft
sqft
text
sold_price
stories
year_built
garage
unit_number
floor_number
}
location {
address {
city
country
line
postal_code
state_code
state
street_direction
street_name
street_number
street_post_direction
street_suffix
unit
coordinate {
lon
lat
}
}
}
list_price
price_per_sqft
source {
id
}
}
} }
limit: 1
offset: $offset
) %s
}""" }"""
% self.listing_type.value.lower() % GENERAL_RESULTS_QUERY
) )
payload = { payload = {
@@ -217,102 +253,168 @@ class RealtorScraper(Scraper):
"variables": variables, "variables": variables,
} }
response = self.session.post(self.search_url, json=payload) response = self.session.post(self.SEARCH_GQL_URL, json=payload)
response.raise_for_status()
response_json = response.json() response_json = response.json()
search_key = "home_search" if "home_search" in query else "property_search"
if return_total: properties: list[Union[Property, dict]] = []
return response_json["data"]["home_search"]["total"]
properties: list[Property] = []
if ( if (
response_json is None response_json is None
or "data" not in response_json or "data" not in response_json
or response_json["data"] is None or response_json["data"] is None
or "home_search" not in response_json["data"] or search_key not in response_json["data"]
or response_json["data"]["home_search"] is None or response_json["data"][search_key] is None
or "results" not in response_json["data"]["home_search"] or "results" not in response_json["data"][search_key]
): ):
return [] return {"total": 0, "properties": []}
for result in response_json["data"]["home_search"]["results"]: properties_list = response_json["data"][search_key]["results"]
street_address, unit = parse_address_two( total_properties = response_json["data"][search_key]["total"]
result["location"]["address"]["line"] offset = variables.get("offset", 0)
)
realty_property = Property(
address=Address(
street_address=street_address,
city=result["location"]["address"]["city"],
state=result["location"]["address"]["state_code"],
zip_code=result["location"]["address"]["postal_code"],
unit=parse_unit(result["location"]["address"]["unit"]),
country="USA",
),
latitude=result["location"]["address"]["coordinate"]["lat"]
if result
and result.get("location")
and result["location"].get("address")
and result["location"]["address"].get("coordinate")
and "lat" in result["location"]["address"]["coordinate"]
else None,
longitude=result["location"]["address"]["coordinate"]["lon"]
if result
and result.get("location")
and result["location"].get("address")
and result["location"]["address"].get("coordinate")
and "lon" in result["location"]["address"]["coordinate"]
else None,
site_name=self.site_name,
property_url="https://www.realtor.com/realestateandhomes-detail/"
+ result["property_id"],
beds=result["description"]["beds"],
baths=result["description"]["baths"],
stories=result["description"]["stories"],
year_built=result["description"]["year_built"],
square_feet=result["description"]["sqft"],
price_per_sqft=result["price_per_sqft"],
price=result["list_price"],
mls_id=result["property_id"],
listing_type=self.listing_type,
lot_area_value=result["description"]["lot_sqft"],
)
properties.append(realty_property) #: limit the number of properties to be processed
#: example, if your offset is 200, and your limit is 250, return 50
properties_list: list[dict] = properties_list[: self.limit - offset]
return properties if self.extra_property_data:
property_ids = [data["property_id"] for data in properties_list]
extra_property_details = self.get_bulk_prop_details(property_ids) or {}
for result in properties_list:
specific_details_for_property = extra_property_details.get(result["property_id"], {})
#: address is retrieved on both homes and search homes, so when merged, homes overrides,
# this gets the internal data we want and only updates that (migrate to a func if more fields)
if "location" in specific_details_for_property:
result["location"].update(specific_details_for_property["location"])
del specific_details_for_property["location"]
result.update(specific_details_for_property)
if self.return_type != ReturnType.raw:
with ThreadPoolExecutor(max_workers=self.NUM_PROPERTY_WORKERS) as executor:
futures = [executor.submit(process_property, result, self.mls_only, self.extra_property_data,
self.exclude_pending, self.listing_type, get_key, process_extra_property_details) for result in properties_list]
for future in as_completed(futures):
result = future.result()
if result:
properties.append(result)
else:
properties = properties_list
return {
"total": total_properties,
"properties": properties,
}
def search(self): def search(self):
location_info = self.handle_location() location_info = self.handle_location()
if not location_info:
return []
location_type = location_info["area_type"] location_type = location_info["area_type"]
if location_type == "address":
property_id = location_info["mpr_id"]
return self.handle_address(property_id)
offset = 0
search_variables = { search_variables = {
"offset": 0,
}
search_type = (
"comps"
if self.radius and location_type == "address"
else "address" if location_type == "address" and not self.radius else "area"
)
if location_type == "address":
if not self.radius: #: single address search, non comps
property_id = location_info["mpr_id"]
return self.handle_home(property_id)
else: #: general search, comps (radius)
if not location_info.get("centroid"):
return []
coordinates = list(location_info["centroid"].values())
search_variables |= {
"coordinates": coordinates,
"radius": "{}mi".format(self.radius),
}
elif location_type == "postal_code":
search_variables |= {
"postal_code": location_info.get("postal_code"),
}
else: #: general search, location
search_variables |= {
"city": location_info.get("city"), "city": location_info.get("city"),
"county": location_info.get("county"), "county": location_info.get("county"),
"state_code": location_info.get("state_code"), "state_code": location_info.get("state_code"),
"postal_code": location_info.get("postal_code"), "postal_code": location_info.get("postal_code"),
"offset": offset,
} }
total = self.handle_area(search_variables, return_total=True) if self.foreclosure:
search_variables["foreclosure"] = self.foreclosure
homes = [] result = self.general_search(search_variables, search_type=search_type)
with ThreadPoolExecutor(max_workers=10) as executor: total = result["total"]
homes = result["properties"]
with ThreadPoolExecutor() as executor:
futures = [ futures = [
executor.submit( executor.submit(
self.handle_area, self.general_search,
variables=search_variables | {"offset": i}, variables=search_variables | {"offset": i},
return_total=False, search_type=search_type,
)
for i in range(
self.DEFAULT_PAGE_SIZE,
min(total, self.limit),
self.DEFAULT_PAGE_SIZE,
) )
for i in range(0, total, 200)
] ]
for future in as_completed(futures): for future in as_completed(futures):
homes.extend(future.result()) homes.extend(future.result()["properties"])
return homes return homes
@retry(
retry=retry_if_exception_type(JSONDecodeError),
wait=wait_exponential(min=4, max=10),
stop=stop_after_attempt(3),
)
def get_bulk_prop_details(self, property_ids: list[str]) -> dict:
"""
Fetch extra property details for multiple properties in a single GraphQL query.
Returns a map of property_id to its details.
"""
if not self.extra_property_data or not property_ids:
return {}
property_ids = list(set(property_ids))
# Construct the bulk query
fragments = "\n".join(
f'home_{property_id}: home(property_id: {property_id}) {{ ...HomeData }}'
for property_id in property_ids
)
query = f"""{HOME_FRAGMENT}
query GetHomes {{
{fragments}
}}"""
response = self.session.post(self.SEARCH_GQL_URL, json={"query": query})
data = response.json()
if "data" not in data:
return {}
properties = data["data"]
return {data.replace('home_', ''): properties[data] for data in properties if properties[data]}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,279 @@
"""
Parsers for realtor.com data processing
"""
from datetime import datetime
from typing import Optional
from ..models import Address, Description, PropertyType
def parse_open_houses(open_houses_data: list[dict] | None) -> list[dict] | None:
"""Parse open houses data and convert date strings to datetime objects"""
if not open_houses_data:
return None
parsed_open_houses = []
for oh in open_houses_data:
parsed_oh = oh.copy()
# Parse start_date and end_date
if parsed_oh.get("start_date"):
try:
parsed_oh["start_date"] = datetime.fromisoformat(parsed_oh["start_date"].replace("Z", "+00:00"))
except (ValueError, AttributeError):
parsed_oh["start_date"] = None
if parsed_oh.get("end_date"):
try:
parsed_oh["end_date"] = datetime.fromisoformat(parsed_oh["end_date"].replace("Z", "+00:00"))
except (ValueError, AttributeError):
parsed_oh["end_date"] = None
parsed_open_houses.append(parsed_oh)
return parsed_open_houses
def parse_units(units_data: list[dict] | None) -> list[dict] | None:
"""Parse units data and convert date strings to datetime objects"""
if not units_data:
return None
parsed_units = []
for unit in units_data:
parsed_unit = unit.copy()
# Parse availability date
if parsed_unit.get("availability") and parsed_unit["availability"].get("date"):
try:
parsed_unit["availability"]["date"] = datetime.fromisoformat(parsed_unit["availability"]["date"].replace("Z", "+00:00"))
except (ValueError, AttributeError):
parsed_unit["availability"]["date"] = None
parsed_units.append(parsed_unit)
return parsed_units
def parse_tax_record(tax_record_data: dict | None) -> dict | None:
"""Parse tax record data and convert date strings to datetime objects"""
if not tax_record_data:
return None
parsed_tax_record = tax_record_data.copy()
# Parse last_update_date
if parsed_tax_record.get("last_update_date"):
try:
parsed_tax_record["last_update_date"] = datetime.fromisoformat(parsed_tax_record["last_update_date"].replace("Z", "+00:00"))
except (ValueError, AttributeError):
parsed_tax_record["last_update_date"] = None
return parsed_tax_record
def parse_current_estimates(estimates_data: list[dict] | None) -> list[dict] | None:
"""Parse current estimates data and convert date strings to datetime objects"""
if not estimates_data:
return None
parsed_estimates = []
for estimate in estimates_data:
parsed_estimate = estimate.copy()
# Parse date
if parsed_estimate.get("date"):
try:
parsed_estimate["date"] = datetime.fromisoformat(parsed_estimate["date"].replace("Z", "+00:00"))
except (ValueError, AttributeError):
parsed_estimate["date"] = None
# Parse source information
if parsed_estimate.get("source"):
source_data = parsed_estimate["source"]
parsed_estimate["source"] = {
"type": source_data.get("type"),
"name": source_data.get("name")
}
parsed_estimates.append(parsed_estimate)
return parsed_estimates
def parse_estimates(estimates_data: dict | None) -> dict | None:
"""Parse estimates data and convert date strings to datetime objects"""
if not estimates_data:
return None
parsed_estimates = estimates_data.copy()
# Parse current_values (which is aliased as currentValues in GraphQL)
current_values = parsed_estimates.get("currentValues") or parsed_estimates.get("current_values")
if current_values:
parsed_current_values = []
for estimate in current_values:
parsed_estimate = estimate.copy()
# Parse date
if parsed_estimate.get("date"):
try:
parsed_estimate["date"] = datetime.fromisoformat(parsed_estimate["date"].replace("Z", "+00:00"))
except (ValueError, AttributeError):
parsed_estimate["date"] = None
# Parse source information
if parsed_estimate.get("source"):
source_data = parsed_estimate["source"]
parsed_estimate["source"] = {
"type": source_data.get("type"),
"name": source_data.get("name")
}
# Convert GraphQL aliases to Pydantic field names
if "estimateHigh" in parsed_estimate:
parsed_estimate["estimate_high"] = parsed_estimate.pop("estimateHigh")
if "estimateLow" in parsed_estimate:
parsed_estimate["estimate_low"] = parsed_estimate.pop("estimateLow")
if "isBestHomeValue" in parsed_estimate:
parsed_estimate["is_best_home_value"] = parsed_estimate.pop("isBestHomeValue")
parsed_current_values.append(parsed_estimate)
parsed_estimates["current_values"] = parsed_current_values
# Remove the GraphQL alias if it exists
if "currentValues" in parsed_estimates:
del parsed_estimates["currentValues"]
return parsed_estimates
def parse_neighborhoods(result: dict) -> Optional[str]:
"""Parse neighborhoods from location data"""
neighborhoods_list = []
neighborhoods = result["location"].get("neighborhoods", [])
if neighborhoods:
for neighborhood in neighborhoods:
name = neighborhood.get("name")
if name:
neighborhoods_list.append(name)
return ", ".join(neighborhoods_list) if neighborhoods_list else None
def handle_none_safely(address_part):
"""Handle None values safely for address parts"""
if address_part is None:
return ""
return address_part
def parse_address(result: dict, search_type: str) -> Address:
"""Parse address data from result"""
if search_type == "general_search":
address = result["location"]["address"]
else:
address = result["address"]
return Address(
full_line=address.get("line"),
street=" ".join(
part
for part in [
address.get("street_number"),
address.get("street_direction"),
address.get("street_name"),
address.get("street_suffix"),
]
if part is not None
).strip(),
unit=address["unit"],
city=address["city"],
state=address["state_code"],
zip=address["postal_code"],
# Additional address fields
street_direction=address.get("street_direction"),
street_number=address.get("street_number"),
street_name=address.get("street_name"),
street_suffix=address.get("street_suffix"),
)
def parse_description(result: dict) -> Description | None:
"""Parse description data from result"""
if not result:
return None
description_data = result.get("description", {})
if description_data is None or not isinstance(description_data, dict):
description_data = {}
style = description_data.get("type", "")
if style is not None:
style = style.upper()
primary_photo = None
if (primary_photo_info := result.get("primary_photo")) and (
primary_photo_href := primary_photo_info.get("href")
):
primary_photo = primary_photo_href.replace("s.jpg", "od-w480_h360_x2.webp?w=1080&q=75")
return Description(
primary_photo=primary_photo,
alt_photos=process_alt_photos(result.get("photos", [])),
style=(PropertyType.__getitem__(style) if style and style in PropertyType.__members__ else None),
beds=description_data.get("beds"),
baths_full=description_data.get("baths_full"),
baths_half=description_data.get("baths_half"),
sqft=description_data.get("sqft"),
lot_sqft=description_data.get("lot_sqft"),
sold_price=(
result.get("last_sold_price") or description_data.get("sold_price")
if result.get("last_sold_date") or result["list_price"] != description_data.get("sold_price")
else None
), #: has a sold date or list and sold price are different
year_built=description_data.get("year_built"),
garage=description_data.get("garage"),
stories=description_data.get("stories"),
text=description_data.get("text"),
# Additional description fields
name=description_data.get("name"),
type=description_data.get("type"),
)
def calculate_days_on_mls(result: dict) -> Optional[int]:
"""Calculate days on MLS from result data"""
list_date_str = result.get("list_date")
list_date = datetime.strptime(list_date_str.split("T")[0], "%Y-%m-%d") if list_date_str else None
last_sold_date_str = result.get("last_sold_date")
last_sold_date = datetime.strptime(last_sold_date_str, "%Y-%m-%d") if last_sold_date_str else None
today = datetime.now()
if list_date:
if result["status"] == "sold":
if last_sold_date:
days = (last_sold_date - list_date).days
if days >= 0:
return days
elif result["status"] in ("for_sale", "for_rent"):
days = (today - list_date).days
if days >= 0:
return days
def process_alt_photos(photos_info: list[dict]) -> list[str] | None:
"""Process alternative photos from photos info"""
if not photos_info:
return None
return [
photo_info["href"].replace("s.jpg", "od-w480_h360_x2.webp?w=1080&q=75")
for photo_info in photos_info
if photo_info.get("href")
]

View File

@@ -0,0 +1,224 @@
"""
Processors for realtor.com property data processing
"""
from datetime import datetime
from typing import Optional
from ..models import (
Property,
ListingType,
Agent,
Broker,
Builder,
Advertisers,
Office,
ReturnType
)
from .parsers import (
parse_open_houses,
parse_units,
parse_tax_record,
parse_current_estimates,
parse_estimates,
parse_neighborhoods,
parse_address,
parse_description,
calculate_days_on_mls,
process_alt_photos
)
def process_advertisers(advertisers: list[dict] | None) -> Advertisers | None:
"""Process advertisers data from GraphQL response"""
if not advertisers:
return None
def _parse_fulfillment_id(fulfillment_id: str | None) -> str | None:
return fulfillment_id if fulfillment_id and fulfillment_id != "0" else None
processed_advertisers = Advertisers()
for advertiser in advertisers:
advertiser_type = advertiser.get("type")
if advertiser_type == "seller": #: agent
processed_advertisers.agent = Agent(
uuid=_parse_fulfillment_id(advertiser.get("fulfillment_id")),
nrds_id=advertiser.get("nrds_id"),
mls_set=advertiser.get("mls_set"),
name=advertiser.get("name"),
email=advertiser.get("email"),
phones=advertiser.get("phones"),
state_license=advertiser.get("state_license"),
)
if advertiser.get("broker") and advertiser["broker"].get("name"): #: has a broker
processed_advertisers.broker = Broker(
uuid=_parse_fulfillment_id(advertiser["broker"].get("fulfillment_id")),
name=advertiser["broker"].get("name"),
)
if advertiser.get("office"): #: has an office
processed_advertisers.office = Office(
uuid=_parse_fulfillment_id(advertiser["office"].get("fulfillment_id")),
mls_set=advertiser["office"].get("mls_set"),
name=advertiser["office"].get("name"),
email=advertiser["office"].get("email"),
phones=advertiser["office"].get("phones"),
)
if advertiser_type == "community": #: could be builder
if advertiser.get("builder"):
processed_advertisers.builder = Builder(
uuid=_parse_fulfillment_id(advertiser["builder"].get("fulfillment_id")),
name=advertiser["builder"].get("name"),
)
return processed_advertisers
def process_property(result: dict, mls_only: bool = False, extra_property_data: bool = False,
exclude_pending: bool = False, listing_type: ListingType = ListingType.FOR_SALE,
get_key_func=None, process_extra_property_details_func=None) -> Property | None:
"""Process property data from GraphQL response"""
mls = result["source"].get("id") if "source" in result and isinstance(result["source"], dict) else None
if not mls and mls_only:
return None
able_to_get_lat_long = (
result
and result.get("location")
and result["location"].get("address")
and result["location"]["address"].get("coordinate")
)
is_pending = result["flags"].get("is_pending")
is_contingent = result["flags"].get("is_contingent")
if (is_pending or is_contingent) and (exclude_pending and listing_type != ListingType.PENDING):
return None
property_id = result["property_id"]
prop_details = process_extra_property_details_func(result) if extra_property_data and process_extra_property_details_func else {}
property_estimates_root = result.get("current_estimates") or result.get("estimates", {}).get("currentValues")
estimated_value = get_key_func(property_estimates_root, [0, "estimate"]) if get_key_func else None
advertisers = process_advertisers(result.get("advertisers"))
realty_property = Property(
mls=mls,
mls_id=(
result["source"].get("listing_id")
if "source" in result and isinstance(result["source"], dict)
else None
),
property_url=result["href"],
property_id=property_id,
listing_id=result.get("listing_id"),
permalink=result.get("permalink"),
status=("PENDING" if is_pending else "CONTINGENT" if is_contingent else result["status"].upper()),
list_price=result["list_price"],
list_price_min=result["list_price_min"],
list_price_max=result["list_price_max"],
list_date=(datetime.fromisoformat(result["list_date"].split("T")[0]) if result.get("list_date") else None),
prc_sqft=result.get("price_per_sqft"),
last_sold_date=(datetime.fromisoformat(result["last_sold_date"]) if result.get("last_sold_date") else None),
pending_date=(datetime.fromisoformat(result["pending_date"].split("T")[0]) if result.get("pending_date") else None),
new_construction=result["flags"].get("is_new_construction") is True,
hoa_fee=(result["hoa"]["fee"] if result.get("hoa") and isinstance(result["hoa"], dict) else None),
latitude=(result["location"]["address"]["coordinate"].get("lat") if able_to_get_lat_long else None),
longitude=(result["location"]["address"]["coordinate"].get("lon") if able_to_get_lat_long else None),
address=parse_address(result, search_type="general_search"),
description=parse_description(result),
neighborhoods=parse_neighborhoods(result),
county=(result["location"]["county"].get("name") if result["location"]["county"] else None),
fips_code=(result["location"]["county"].get("fips_code") if result["location"]["county"] else None),
days_on_mls=calculate_days_on_mls(result),
nearby_schools=prop_details.get("schools"),
assessed_value=prop_details.get("assessed_value"),
estimated_value=estimated_value if estimated_value else None,
advertisers=advertisers,
tax=prop_details.get("tax"),
tax_history=prop_details.get("tax_history"),
# Additional fields from GraphQL
mls_status=result.get("mls_status"),
last_sold_price=result.get("last_sold_price"),
tags=result.get("tags"),
details=result.get("details"),
open_houses=parse_open_houses(result.get("open_houses")),
pet_policy=result.get("pet_policy"),
units=parse_units(result.get("units")),
monthly_fees=result.get("monthly_fees"),
one_time_fees=result.get("one_time_fees"),
parking=result.get("parking"),
terms=result.get("terms"),
popularity=result.get("popularity"),
tax_record=parse_tax_record(result.get("tax_record")),
parcel_info=result.get("location", {}).get("parcel"),
current_estimates=parse_current_estimates(result.get("current_estimates")),
estimates=parse_estimates(result.get("estimates")),
photos=result.get("photos"),
flags=result.get("flags"),
)
return realty_property
def process_extra_property_details(result: dict, get_key_func=None) -> dict:
"""Process extra property details from GraphQL response"""
if get_key_func:
schools = get_key_func(result, ["nearbySchools", "schools"])
assessed_value = get_key_func(result, ["taxHistory", 0, "assessment", "total"])
tax_history = get_key_func(result, ["taxHistory"])
else:
nearby_schools = result.get("nearbySchools")
schools = nearby_schools.get("schools", []) if nearby_schools else []
tax_history_data = result.get("taxHistory", [])
assessed_value = tax_history_data[0]["assessment"]["total"] if tax_history_data and tax_history_data[0].get("assessment", {}).get("total") else None
tax_history = tax_history_data
if schools:
schools = [school["district"]["name"] for school in schools if school["district"].get("name")]
# Process tax history
latest_tax = None
processed_tax_history = None
if tax_history and isinstance(tax_history, list):
tax_history = sorted(tax_history, key=lambda x: x.get("year", 0), reverse=True)
if tax_history and "tax" in tax_history[0]:
latest_tax = tax_history[0]["tax"]
processed_tax_history = []
for entry in tax_history:
if "year" in entry and "tax" in entry:
processed_entry = {
"year": entry["year"],
"tax": entry["tax"],
}
if "assessment" in entry and isinstance(entry["assessment"], dict):
processed_entry["assessment"] = {
"building": entry["assessment"].get("building"),
"land": entry["assessment"].get("land"),
"total": entry["assessment"].get("total"),
}
processed_tax_history.append(processed_entry)
return {
"schools": schools if schools else None,
"assessed_value": assessed_value if assessed_value else None,
"tax": latest_tax,
"tax_history": processed_tax_history,
}
def get_key(data: dict, keys: list):
"""Get nested key from dictionary safely"""
try:
value = data
for key in keys:
value = value[key]
return value or {}
except (KeyError, TypeError, IndexError):
return {}

View File

@@ -0,0 +1,300 @@
_SEARCH_HOMES_DATA_BASE = """{
pending_date
listing_id
property_id
href
permalink
list_date
status
mls_status
last_sold_price
last_sold_date
list_price
list_price_max
list_price_min
price_per_sqft
tags
open_houses {
start_date
end_date
description
time_zone
dst
href
methods
}
details {
category
text
parent_category
}
pet_policy {
cats
dogs
dogs_small
dogs_large
__typename
}
units {
availability {
date
__typename
}
description {
baths_consolidated
baths
beds
sqft
__typename
}
photos(https: true) {
title
href
tags {
label
}
}
list_price
__typename
}
flags {
is_contingent
is_pending
is_new_construction
}
description {
type
sqft
beds
baths_full
baths_half
lot_sqft
year_built
garage
type
name
stories
text
}
source {
id
listing_id
}
hoa {
fee
}
location {
address {
street_direction
street_number
street_name
street_suffix
line
unit
city
state_code
postal_code
coordinate {
lon
lat
}
}
county {
name
fips_code
}
neighborhoods {
name
}
}
tax_record {
cl_id
public_record_id
last_update_date
apn
tax_parcel_id
}
primary_photo(https: true) {
href
}
photos(https: true) {
title
href
tags {
label
}
}
advertisers {
email
broker {
name
fulfillment_id
}
type
name
fulfillment_id
builder {
name
fulfillment_id
}
phones {
ext
primary
type
number
}
office {
name
email
fulfillment_id
href
phones {
number
type
primary
ext
}
mls_set
}
corporation {
specialties
name
bio
href
fulfillment_id
}
mls_set
nrds_id
state_license
rental_corporation {
fulfillment_id
}
rental_management {
name
href
fulfillment_id
}
}
"""
HOME_FRAGMENT = """
fragment HomeData on Home {
property_id
nearbySchools: nearby_schools(radius: 5.0, limit_per_level: 3) {
__typename schools { district { __typename id name } }
}
popularity {
periods {
clicks_total
views_total
dwell_time_mean
dwell_time_median
leads_total
shares_total
saves_total
last_n_days
}
}
location {
parcel {
parcel_id
}
}
taxHistory: tax_history { __typename tax year assessment { __typename building land total } }
monthly_fees {
description
display_amount
}
one_time_fees {
description
display_amount
}
parking {
unassigned_space_rent
assigned_spaces_available
description
assigned_space_rent
}
terms {
text
category
}
}
"""
HOMES_DATA = """%s
nearbySchools: nearby_schools(radius: 5.0, limit_per_level: 3) {
__typename schools { district { __typename id name } }
}
monthly_fees {
description
display_amount
}
one_time_fees {
description
display_amount
}
popularity {
periods {
clicks_total
views_total
dwell_time_mean
dwell_time_median
leads_total
shares_total
saves_total
last_n_days
}
}
location {
parcel {
parcel_id
}
}
parking {
unassigned_space_rent
assigned_spaces_available
description
assigned_space_rent
}
terms {
text
category
}
taxHistory: tax_history { __typename tax year assessment { __typename building land total } }
estimates {
__typename
currentValues: current_values {
__typename
source { __typename type name }
estimate
estimateHigh: estimate_high
estimateLow: estimate_low
date
isBestHomeValue: isbest_homevalue
}
}
}""" % _SEARCH_HOMES_DATA_BASE
SEARCH_HOMES_DATA = """%s
current_estimates {
__typename
source {
__typename
type
name
}
estimate
estimateHigh: estimate_high
estimateLow: estimate_low
date
isBestHomeValue: isbest_homevalue
}
}""" % _SEARCH_HOMES_DATA_BASE
GENERAL_RESULTS_QUERY = """{
count
total
results %s
}""" % SEARCH_HOMES_DATA

View File

@@ -1,244 +0,0 @@
import json
from typing import Any
from .. import Scraper
from ....utils import parse_address_two, parse_unit
from ..models import Property, Address, PropertyType, ListingType, SiteName
from ....exceptions import NoResultsFound
class RedfinScraper(Scraper):
def __init__(self, scraper_input):
super().__init__(scraper_input)
self.listing_type = scraper_input.listing_type
def _handle_location(self):
url = "https://www.redfin.com/stingray/do/location-autocomplete?v=2&al=1&location={}".format(
self.location
)
response = self.session.get(url)
response_json = json.loads(response.text.replace("{}&&", ""))
def get_region_type(match_type: str):
if match_type == "4":
return "2" #: zip
elif match_type == "2":
return "6" #: city
elif match_type == "1":
return "address" #: address, needs to be handled differently
if "exactMatch" not in response_json["payload"]:
raise NoResultsFound(
"No results found for location: {}".format(self.location)
)
if response_json["payload"]["exactMatch"] is not None:
target = response_json["payload"]["exactMatch"]
else:
target = response_json["payload"]["sections"][0]["rows"][0]
return target["id"].split("_")[1], get_region_type(target["type"])
def _parse_home(self, home: dict, single_search: bool = False) -> Property:
def get_value(key: str) -> Any | None:
if key in home and "value" in home[key]:
return home[key]["value"]
if not single_search:
street_address, unit = parse_address_two(get_value("streetLine"))
unit = parse_unit(get_value("streetLine"))
address = Address(
street_address=street_address,
city=home["city"],
state=home["state"],
zip_code=home["zip"],
unit=unit,
country="USA",
)
else:
address_info = home["streetAddress"]
street_address, unit = parse_address_two(address_info["assembledAddress"])
address = Address(
street_address=street_address,
city=home["city"],
state=home["state"],
zip_code=home["zip"],
unit=unit,
country="USA",
)
url = "https://www.redfin.com{}".format(home["url"])
#: property_type = home["propertyType"] if "propertyType" in home else None
lot_size_data = home.get("lotSize")
if not isinstance(lot_size_data, int):
lot_size = (
lot_size_data.get("value", None)
if isinstance(lot_size_data, dict)
else None
)
else:
lot_size = lot_size_data
return Property(
site_name=self.site_name,
listing_type=self.listing_type,
address=address,
property_url=url,
beds=home["beds"] if "beds" in home else None,
baths=home["baths"] if "baths" in home else None,
stories=home["stories"] if "stories" in home else None,
agent_name=get_value("listingAgent"),
description=home["listingRemarks"] if "listingRemarks" in home else None,
year_built=get_value("yearBuilt")
if not single_search
else home["yearBuilt"],
square_feet=get_value("sqFt"),
lot_area_value=lot_size,
property_type=PropertyType.from_int_code(home.get("propertyType")),
price_per_sqft=get_value("pricePerSqFt"),
price=get_value("price"),
mls_id=get_value("mlsId"),
latitude=home["latLong"]["latitude"]
if "latLong" in home and "latitude" in home["latLong"]
else None,
longitude=home["latLong"]["longitude"]
if "latLong" in home and "longitude" in home["latLong"]
else None,
)
def _handle_rentals(self, region_id, region_type):
url = f"https://www.redfin.com/stingray/api/v1/search/rentals?al=1&isRentals=true&region_id={region_id}&region_type={region_type}&num_homes=100000"
response = self.session.get(url)
response.raise_for_status()
homes = response.json()
properties_list = []
for home in homes["homes"]:
home_data = home["homeData"]
rental_data = home["rentalExtension"]
property_url = f"https://www.redfin.com{home_data.get('url', '')}"
address_info = home_data.get("addressInfo", {})
centroid = address_info.get("centroid", {}).get("centroid", {})
address = Address(
street_address=address_info.get("formattedStreetLine", None),
city=address_info.get("city", None),
state=address_info.get("state", None),
zip_code=address_info.get("zip", None),
unit=None,
country="US" if address_info.get("countryCode", None) == 1 else None,
)
price_range = rental_data.get("rentPriceRange", {"min": None, "max": None})
bed_range = rental_data.get("bedRange", {"min": None, "max": None})
bath_range = rental_data.get("bathRange", {"min": None, "max": None})
sqft_range = rental_data.get("sqftRange", {"min": None, "max": None})
property_ = Property(
property_url=property_url,
site_name=SiteName.REDFIN,
listing_type=ListingType.FOR_RENT,
address=address,
apt_min_beds=bed_range.get("min", None),
apt_min_baths=bath_range.get("min", None),
apt_max_beds=bed_range.get("max", None),
apt_max_baths=bath_range.get("max", None),
description=rental_data.get("description", None),
latitude=centroid.get("latitude", None),
longitude=centroid.get("longitude", None),
apt_min_price=price_range.get("min", None),
apt_max_price=price_range.get("max", None),
apt_min_sqft=sqft_range.get("min", None),
apt_max_sqft=sqft_range.get("max", None),
img_src=home_data.get("staticMapUrl", None),
posted_time=rental_data.get("lastUpdated", None),
bldg_name=rental_data.get("propertyName", None),
)
properties_list.append(property_)
if not properties_list:
raise NoResultsFound("No rentals found for the given location.")
return properties_list
def _parse_building(self, building: dict) -> Property:
street_address = " ".join(
[
building["address"]["streetNumber"],
building["address"]["directionalPrefix"],
building["address"]["streetName"],
building["address"]["streetType"],
]
)
street_address, unit = parse_address_two(street_address)
return Property(
site_name=self.site_name,
property_type=PropertyType("BUILDING"),
address=Address(
street_address=street_address,
city=building["address"]["city"],
state=building["address"]["stateOrProvinceCode"],
zip_code=building["address"]["postalCode"],
unit=parse_unit(
" ".join(
[
building["address"]["unitType"],
building["address"]["unitValue"],
]
)
),
),
property_url="https://www.redfin.com{}".format(building["url"]),
listing_type=self.listing_type,
bldg_unit_count=building["numUnitsForSale"],
)
def handle_address(self, home_id: str):
"""
EPs:
https://www.redfin.com/stingray/api/home/details/initialInfo?al=1&path=/TX/Austin/70-Rainey-St-78701/unit-1608/home/147337694
https://www.redfin.com/stingray/api/home/details/mainHouseInfoPanelInfo?propertyId=147337694&accessLevel=3
https://www.redfin.com/stingray/api/home/details/aboveTheFold?propertyId=147337694&accessLevel=3
https://www.redfin.com/stingray/api/home/details/belowTheFold?propertyId=147337694&accessLevel=3
"""
url = "https://www.redfin.com/stingray/api/home/details/aboveTheFold?propertyId={}&accessLevel=3".format(
home_id
)
response = self.session.get(url)
response_json = json.loads(response.text.replace("{}&&", ""))
parsed_home = self._parse_home(
response_json["payload"]["addressSectionInfo"], single_search=True
)
return [parsed_home]
def search(self):
region_id, region_type = self._handle_location()
if region_type == "address":
home_id = region_id
return self.handle_address(home_id)
if self.listing_type == ListingType.FOR_RENT:
return self._handle_rentals(region_id, region_type)
else:
if self.listing_type == ListingType.FOR_SALE:
url = f"https://www.redfin.com/stingray/api/gis?al=1&region_id={region_id}&region_type={region_type}&num_homes=100000"
else:
url = f"https://www.redfin.com/stingray/api/gis?al=1&region_id={region_id}&region_type={region_type}&sold_within_days=30&num_homes=100000"
response = self.session.get(url)
response_json = json.loads(response.text.replace("{}&&", ""))
homes = [
self._parse_home(home) for home in response_json["payload"]["homes"]
] + [
self._parse_building(building)
for building in response_json["payload"]["buildings"].values()
]
return homes

View File

@@ -1,333 +0,0 @@
import re
import json
from .. import Scraper
from ....utils import parse_address_two, parse_unit
from ....exceptions import GeoCoordsNotFound, NoResultsFound
from ..models import Property, Address, ListingType, PropertyType
class ZillowScraper(Scraper):
def __init__(self, scraper_input):
super().__init__(scraper_input)
if not self.is_plausible_location(self.location):
raise NoResultsFound("Invalid location input: {}".format(self.location))
if self.listing_type == ListingType.FOR_SALE:
self.url = f"https://www.zillow.com/homes/for_sale/{self.location}_rb/"
elif self.listing_type == ListingType.FOR_RENT:
self.url = f"https://www.zillow.com/homes/for_rent/{self.location}_rb/"
else:
self.url = f"https://www.zillow.com/homes/recently_sold/{self.location}_rb/"
def is_plausible_location(self, location: str) -> bool:
url = (
"https://www.zillowstatic.com/autocomplete/v3/suggestions?q={"
"}&abKey=6666272a-4b99-474c-b857-110ec438732b&clientId=homepage-render"
).format(location)
response = self.session.get(url)
return response.json()["results"] != []
def search(self):
resp = self.session.get(
self.url, headers=self._get_headers()
)
resp.raise_for_status()
content = resp.text
match = re.search(
r'<script id="__NEXT_DATA__" type="application/json">(.*?)</script>',
content,
re.DOTALL,
)
if not match:
raise NoResultsFound(
"No results were found for Zillow with the given Location."
)
json_str = match.group(1)
data = json.loads(json_str)
if "searchPageState" in data["props"]["pageProps"]:
pattern = r'window\.mapBounds = \{\s*"west":\s*(-?\d+\.\d+),\s*"east":\s*(-?\d+\.\d+),\s*"south":\s*(-?\d+\.\d+),\s*"north":\s*(-?\d+\.\d+)\s*\};'
match = re.search(pattern, content)
if match:
coords = [float(coord) for coord in match.groups()]
return self._fetch_properties_backend(coords)
else:
raise GeoCoordsNotFound("Box bounds could not be located.")
elif "gdpClientCache" in data["props"]["pageProps"]:
gdp_client_cache = json.loads(data["props"]["pageProps"]["gdpClientCache"])
main_key = list(gdp_client_cache.keys())[0]
property_data = gdp_client_cache[main_key]["property"]
property = self._get_single_property_page(property_data)
return [property]
raise NoResultsFound("Specific property data not found in the response.")
def _fetch_properties_backend(self, coords):
url = "https://www.zillow.com/async-create-search-page-state"
filter_state_for_sale = {
"sortSelection": {
# "value": "globalrelevanceex"
"value": "days"
},
"isAllHomes": {"value": True},
}
filter_state_for_rent = {
"isForRent": {"value": True},
"isForSaleByAgent": {"value": False},
"isForSaleByOwner": {"value": False},
"isNewConstruction": {"value": False},
"isComingSoon": {"value": False},
"isAuction": {"value": False},
"isForSaleForeclosure": {"value": False},
"isAllHomes": {"value": True},
}
filter_state_sold = {
"isRecentlySold": {"value": True},
"isForSaleByAgent": {"value": False},
"isForSaleByOwner": {"value": False},
"isNewConstruction": {"value": False},
"isComingSoon": {"value": False},
"isAuction": {"value": False},
"isForSaleForeclosure": {"value": False},
"isAllHomes": {"value": True},
}
selected_filter = (
filter_state_for_rent
if self.listing_type == ListingType.FOR_RENT
else filter_state_for_sale
if self.listing_type == ListingType.FOR_SALE
else filter_state_sold
)
payload = {
"searchQueryState": {
"pagination": {},
"isMapVisible": True,
"mapBounds": {
"west": coords[0],
"east": coords[1],
"south": coords[2],
"north": coords[3],
},
"filterState": selected_filter,
"isListVisible": True,
"mapZoom": 11,
},
"wants": {"cat1": ["mapResults"]},
"isDebugRequest": False,
}
resp = self.session.put(
url, headers=self._get_headers(), json=payload
)
resp.raise_for_status()
a = resp.json()
return self._parse_properties(resp.json())
def _parse_properties(self, property_data: dict):
mapresults = property_data["cat1"]["searchResults"]["mapResults"]
properties_list = []
for result in mapresults:
if "hdpData" in result:
home_info = result["hdpData"]["homeInfo"]
address_data = {
"street_address": parse_address_two(home_info["streetAddress"])[0],
"unit": parse_unit(home_info["unit"])
if "unit" in home_info
else None,
"city": home_info["city"],
"state": home_info["state"],
"zip_code": home_info["zipcode"],
"country": home_info["country"],
}
property_data = {
"site_name": self.site_name,
"address": Address(**address_data),
"property_url": f"https://www.zillow.com{result['detailUrl']}",
"beds": int(home_info["bedrooms"])
if "bedrooms" in home_info
else None,
"baths": home_info.get("bathrooms"),
"square_feet": int(home_info["livingArea"])
if "livingArea" in home_info
else None,
"currency": home_info["currency"],
"price": home_info.get("price"),
"tax_assessed_value": int(home_info["taxAssessedValue"])
if "taxAssessedValue" in home_info
else None,
"property_type": PropertyType(home_info["homeType"]),
"listing_type": ListingType(
home_info["statusType"]
if "statusType" in home_info
else self.listing_type
),
"lot_area_value": round(home_info["lotAreaValue"], 2)
if "lotAreaValue" in home_info
else None,
"lot_area_unit": home_info.get("lotAreaUnit"),
"latitude": result["latLong"]["latitude"],
"longitude": result["latLong"]["longitude"],
"status_text": result.get("statusText"),
"posted_time": result["variableData"]["text"]
if "variableData" in result
and "text" in result["variableData"]
and result["variableData"]["type"] == "TIME_ON_INFO"
else None,
"img_src": result.get("imgSrc"),
"price_per_sqft": int(home_info["price"] // home_info["livingArea"])
if "livingArea" in home_info
and home_info["livingArea"] != 0
and "price" in home_info
else None,
}
property_obj = Property(**property_data)
properties_list.append(property_obj)
elif "isBuilding" in result:
price = result["price"]
building_data = {
"property_url": f"https://www.zillow.com{result['detailUrl']}",
"site_name": self.site_name,
"property_type": PropertyType("BUILDING"),
"listing_type": ListingType(result["statusType"]),
"img_src": result["imgSrc"],
"price": int(price.replace("From $", "").replace(",", ""))
if "From $" in price
else None,
"apt_min_price": int(
price.replace("$", "").replace(",", "").replace("+/mo", "")
)
if "+/mo" in price
else None,
"address": self._extract_address(result["address"]),
"bldg_min_beds": result["minBeds"],
"currency": "USD",
"bldg_min_baths": result["minBaths"],
"bldg_min_area": result.get("minArea"),
"bldg_unit_count": result["unitCount"],
"bldg_name": result.get("communityName"),
"status_text": result["statusText"],
"latitude": result["latLong"]["latitude"],
"longitude": result["latLong"]["longitude"],
}
building_obj = Property(**building_data)
properties_list.append(building_obj)
return properties_list
def _get_single_property_page(self, property_data: dict):
"""
This method is used when a user enters the exact location & zillow returns just one property
"""
url = (
f"https://www.zillow.com{property_data['hdpUrl']}"
if "zillow.com" not in property_data["hdpUrl"]
else property_data["hdpUrl"]
)
address_data = property_data["address"]
street_address, unit = parse_address_two(address_data["streetAddress"])
address = Address(
street_address=street_address,
unit=unit,
city=address_data["city"],
state=address_data["state"],
zip_code=address_data["zipcode"],
country=property_data.get("country"),
)
property_type = property_data.get("homeType", None)
return Property(
site_name=self.site_name,
address=address,
property_url=url,
beds=property_data.get("bedrooms", None),
baths=property_data.get("bathrooms", None),
year_built=property_data.get("yearBuilt", None),
price=property_data.get("price", None),
tax_assessed_value=property_data.get("taxAssessedValue", None),
latitude=property_data.get("latitude"),
longitude=property_data.get("longitude"),
img_src=property_data.get("streetViewTileImageUrlMediumAddress"),
currency=property_data.get("currency", None),
lot_area_value=property_data.get("lotAreaValue"),
lot_area_unit=property_data["lotAreaUnits"].lower()
if "lotAreaUnits" in property_data
else None,
agent_name=property_data.get("attributionInfo", {}).get("agentName", None),
stories=property_data.get("resoFacts", {}).get("stories", None),
description=property_data.get("description", None),
mls_id=property_data.get("attributionInfo", {}).get("mlsId", None),
price_per_sqft=property_data.get("resoFacts", {}).get(
"pricePerSquareFoot", None
),
square_feet=property_data.get("livingArea", None),
property_type=PropertyType(property_type),
listing_type=self.listing_type,
)
def _extract_address(self, address_str):
"""
Extract address components from a string formatted like '555 Wedglea Dr, Dallas, TX',
and return an Address object.
"""
parts = address_str.split(", ")
if len(parts) != 3:
raise ValueError(f"Unexpected address format: {address_str}")
street_address = parts[0].strip()
city = parts[1].strip()
state_zip = parts[2].split(" ")
if len(state_zip) == 1:
state = state_zip[0].strip()
zip_code = None
elif len(state_zip) == 2:
state = state_zip[0].strip()
zip_code = state_zip[1].strip()
else:
raise ValueError(f"Unexpected state/zip format in address: {address_str}")
street_address, unit = parse_address_two(street_address)
return Address(
street_address=street_address,
city=city,
unit=unit,
state=state,
zip_code=zip_code,
country="USA",
)
@staticmethod
def _get_headers():
return {
"authority": "www.zillow.com",
"accept": "*/*",
"accept-language": "en-US,en;q=0.9",
"content-type": "application/json",
"cookie": 'zjs_user_id=null; zg_anonymous_id=%220976ab81-2950-4013-98f0-108b15a554d2%22; zguid=24|%246b1bc625-3955-4d1e-a723-e59602e4ed08; g_state={"i_p":1693611172520,"i_l":1}; zgsession=1|d48820e2-1659-4d2f-b7d2-99a8127dd4f3; zjs_anonymous_id=%226b1bc625-3955-4d1e-a723-e59602e4ed08%22; JSESSIONID=82E8274D3DC8AF3AB9C8E613B38CF861; search=6|1697585860120%7Crb%3DDallas%252C-TX%26rect%3D33.016646%252C-96.555516%252C32.618763%252C-96.999347%26disp%3Dmap%26mdm%3Dauto%26sort%3Ddays%26listPriceActive%3D1%26fs%3D1%26fr%3D0%26mmm%3D0%26rs%3D0%26ah%3D0%26singlestory%3D0%26abo%3D0%26garage%3D0%26pool%3D0%26ac%3D0%26waterfront%3D0%26finished%3D0%26unfinished%3D0%26cityview%3D0%26mountainview%3D0%26parkview%3D0%26waterview%3D0%26hoadata%3D1%263dhome%3D0%26commuteMode%3Ddriving%26commuteTimeOfDay%3Dnow%09%0938128%09%7B%22isList%22%3Atrue%2C%22isMap%22%3Atrue%7D%09%09%09%09%09; AWSALB=gAlFj5Ngnd4bWP8k7CME/+YlTtX9bHK4yEkdPHa3VhL6K523oGyysFxBEpE1HNuuyL+GaRPvt2i/CSseAb+zEPpO4SNjnbLAJzJOOO01ipnWN3ZgPaa5qdv+fAki; AWSALBCORS=gAlFj5Ngnd4bWP8k7CME/+YlTtX9bHK4yEkdPHa3VhL6K523oGyysFxBEpE1HNuuyL+GaRPvt2i/CSseAb+zEPpO4SNjnbLAJzJOOO01ipnWN3ZgPaa5qdv+fAki; search=6|1697587741808%7Crect%3D33.37188814545521%2C-96.34484483007813%2C32.260490641365685%2C-97.21001816992188%26disp%3Dmap%26mdm%3Dauto%26p%3D1%26sort%3Ddays%26z%3D1%26listPriceActive%3D1%26fs%3D1%26fr%3D0%26mmm%3D0%26rs%3D0%26ah%3D0%26singlestory%3D0%26housing-connector%3D0%26abo%3D0%26garage%3D0%26pool%3D0%26ac%3D0%26waterfront%3D0%26finished%3D0%26unfinished%3D0%26cityview%3D0%26mountainview%3D0%26parkview%3D0%26waterview%3D0%26hoadata%3D1%26zillow-owned%3D0%263dhome%3D0%26featuredMultiFamilyBuilding%3D0%26commuteMode%3Ddriving%26commuteTimeOfDay%3Dnow%09%09%09%7B%22isList%22%3Atrue%2C%22isMap%22%3Atrue%7D%09%09%09%09%09',
"origin": "https://www.zillow.com",
"referer": "https://www.zillow.com",
"sec-ch-ua": '"Chromium";v="116", "Not)A;Brand";v="24", "Google Chrome";v="116"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"Windows"',
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-origin",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36",
}

View File

@@ -1,14 +1,14 @@
class InvalidSite(Exception):
"""Raised when a provided site is does not exist."""
class InvalidListingType(Exception): class InvalidListingType(Exception):
"""Raised when a provided listing type is does not exist.""" """Raised when a provided listing type is does not exist."""
class NoResultsFound(Exception): class InvalidDate(Exception):
"""Raised when no results are found for the given location""" """Raised when only one of date_from or date_to is provided or not in the correct format. ex: 2023-10-23"""
class GeoCoordsNotFound(Exception): class AuthenticationError(Exception):
"""Raised when no property is found for the given address""" """Raised when there is an issue with the authentication process."""
def __init__(self, *args, response):
super().__init__(*args)
self.response = response

View File

@@ -1,48 +1,181 @@
import re from __future__ import annotations
import pandas as pd
from datetime import datetime
from .core.scrapers.models import Property, ListingType, Advertisers
from .exceptions import InvalidListingType, InvalidDate
ordered_properties = [
"property_url",
"property_id",
"listing_id",
"permalink",
"mls",
"mls_id",
"status",
"mls_status",
"text",
"style",
"formatted_address",
"full_street_line",
"street",
"unit",
"city",
"state",
"zip_code",
"beds",
"full_baths",
"half_baths",
"sqft",
"year_built",
"days_on_mls",
"list_price",
"list_price_min",
"list_price_max",
"list_date",
"pending_date",
"sold_price",
"last_sold_date",
"last_sold_price",
"assessed_value",
"estimated_value",
"tax",
"tax_history",
"new_construction",
"lot_sqft",
"price_per_sqft",
"latitude",
"longitude",
"neighborhoods",
"county",
"fips_code",
"stories",
"hoa_fee",
"parking_garage",
"agent_id",
"agent_name",
"agent_email",
"agent_phones",
"agent_mls_set",
"agent_nrds_id",
"broker_id",
"broker_name",
"builder_id",
"builder_name",
"office_id",
"office_mls_set",
"office_name",
"office_email",
"office_phones",
"nearby_schools",
"primary_photo",
"alt_photos"
]
def parse_address_two(street_address: str) -> tuple: def process_result(result: Property) -> pd.DataFrame:
if not street_address: prop_data = {prop: None for prop in ordered_properties}
return street_address, None prop_data.update(result.model_dump())
apt_match = re.search( if "address" in prop_data and prop_data["address"]:
r"(APT\s*[\dA-Z]+|#[\dA-Z]+|UNIT\s*[\dA-Z]+|LOT\s*[\dA-Z]+|SUITE\s*[\dA-Z]+)$", address_data = prop_data["address"]
street_address, prop_data["full_street_line"] = address_data.get("full_line")
re.I, prop_data["street"] = address_data.get("street")
prop_data["unit"] = address_data.get("unit")
prop_data["city"] = address_data.get("city")
prop_data["state"] = address_data.get("state")
prop_data["zip_code"] = address_data.get("zip")
prop_data["formatted_address"] = address_data.get("formatted_address")
if "advertisers" in prop_data and prop_data.get("advertisers"):
advertiser_data = prop_data["advertisers"]
if advertiser_data.get("agent"):
agent_data = advertiser_data["agent"]
prop_data["agent_id"] = agent_data.get("uuid")
prop_data["agent_name"] = agent_data.get("name")
prop_data["agent_email"] = agent_data.get("email")
prop_data["agent_phones"] = agent_data.get("phones")
prop_data["agent_mls_set"] = agent_data.get("mls_set")
prop_data["agent_nrds_id"] = agent_data.get("nrds_id")
if advertiser_data.get("broker"):
broker_data = advertiser_data["broker"]
prop_data["broker_id"] = broker_data.get("uuid")
prop_data["broker_name"] = broker_data.get("name")
if advertiser_data.get("builder"):
builder_data = advertiser_data["builder"]
prop_data["builder_id"] = builder_data.get("uuid")
prop_data["builder_name"] = builder_data.get("name")
if advertiser_data.get("office"):
office_data = advertiser_data["office"]
prop_data["office_id"] = office_data.get("uuid")
prop_data["office_name"] = office_data.get("name")
prop_data["office_email"] = office_data.get("email")
prop_data["office_phones"] = office_data.get("phones")
prop_data["office_mls_set"] = office_data.get("mls_set")
prop_data["price_per_sqft"] = prop_data["prc_sqft"]
prop_data["nearby_schools"] = filter(None, prop_data["nearby_schools"]) if prop_data["nearby_schools"] else None
prop_data["nearby_schools"] = ", ".join(set(prop_data["nearby_schools"])) if prop_data["nearby_schools"] else None
# Convert datetime objects to strings for CSV
for date_field in ["list_date", "pending_date", "last_sold_date"]:
if prop_data.get(date_field):
prop_data[date_field] = prop_data[date_field].strftime("%Y-%m-%d") if hasattr(prop_data[date_field], 'strftime') else prop_data[date_field]
# Convert HttpUrl objects to strings for CSV
if prop_data.get("property_url"):
prop_data["property_url"] = str(prop_data["property_url"])
description = result.description
if description:
prop_data["primary_photo"] = str(description.primary_photo) if description.primary_photo else None
prop_data["alt_photos"] = ", ".join(str(url) for url in description.alt_photos) if description.alt_photos else None
prop_data["style"] = (
description.style
if isinstance(description.style, str)
else description.style.value if description.style else None
) )
prop_data["beds"] = description.beds
prop_data["full_baths"] = description.baths_full
prop_data["half_baths"] = description.baths_half
prop_data["sqft"] = description.sqft
prop_data["lot_sqft"] = description.lot_sqft
prop_data["sold_price"] = description.sold_price
prop_data["year_built"] = description.year_built
prop_data["parking_garage"] = description.garage
prop_data["stories"] = description.stories
prop_data["text"] = description.text
if apt_match: properties_df = pd.DataFrame([prop_data])
apt_str = apt_match.group().strip() properties_df = properties_df.reindex(columns=ordered_properties)
cleaned_apt_str = re.sub(
r"(APT\s*|UNIT\s*|LOT\s*|SUITE\s*)", "#", apt_str, flags=re.I
)
main_address = street_address.replace(apt_str, "").strip() return properties_df[ordered_properties]
return main_address, cleaned_apt_str
else:
return street_address, None
def parse_unit(street_address: str): def validate_input(listing_type: str) -> None:
if not street_address: if listing_type.upper() not in ListingType.__members__:
return None raise InvalidListingType(f"Provided listing type, '{listing_type}', does not exist.")
apt_match = re.search(
r"(APT\s*[\dA-Z]+|#[\dA-Z]+|UNIT\s*[\dA-Z]+|LOT\s*[\dA-Z]+)$",
street_address,
re.I,
)
if apt_match:
apt_str = apt_match.group().strip()
apt_str = re.sub(r"(APT\s*|UNIT\s*|LOT\s*)", "#", apt_str, flags=re.I)
return apt_str
else:
return None
if __name__ == "__main__": def validate_dates(date_from: str | None, date_to: str | None) -> None:
print(parse_address_two("4303 E Cactus Rd Apt 126")) if isinstance(date_from, str) != isinstance(date_to, str):
print(parse_address_two("1234 Elm Street apt 2B")) raise InvalidDate("Both date_from and date_to must be provided.")
print(parse_address_two("1234 Elm Street UNIT 3A"))
print(parse_address_two("1234 Elm Street unit 3A")) if date_from and date_to:
print(parse_address_two("1234 Elm Street SuIte 3A")) try:
date_from_obj = datetime.strptime(date_from, "%Y-%m-%d")
date_to_obj = datetime.strptime(date_to, "%Y-%m-%d")
if date_to_obj < date_from_obj:
raise InvalidDate("date_to must be after date_from.")
except ValueError:
raise InvalidDate(f"Invalid date format or range")
def validate_limit(limit: int) -> None:
#: 1 -> 10000 limit
if limit is not None and (limit < 1 or limit > 10000):
raise ValueError("Property limit must be between 1 and 10,000.")

1012
poetry.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -1,23 +1,25 @@
[tool.poetry] [tool.poetry]
name = "homeharvest" name = "homeharvest"
version = "0.2.4" version = "0.5.0"
description = "Real estate scraping library supporting Zillow, Realtor.com & Redfin." description = "Real estate scraping library"
authors = ["Zachary Hampton <zachary@zacharysproducts.com>", "Cullen Watson <cullen@cullen.ai>"] authors = ["Zachary Hampton <zachary@bunsly.com>", "Cullen Watson <cullen@bunsly.com>"]
homepage = "https://github.com/ZacharyHampton/HomeHarvest" homepage = "https://github.com/Bunsly/HomeHarvest"
readme = "README.md" readme = "README.md"
[tool.poetry.scripts] [tool.poetry.scripts]
homeharvest = "homeharvest.cli:main" homeharvest = "homeharvest.cli:main"
[tool.poetry.dependencies] [tool.poetry.dependencies]
python = "^3.10" python = ">=3.9,<3.13"
requests = "^2.31.0" requests = "^2.32.4"
pandas = "^2.1.0" pandas = "^2.3.1"
openpyxl = "^3.1.2" pydantic = "^2.11.7"
tenacity = "^9.1.2"
[tool.poetry.group.dev.dependencies] [tool.poetry.group.dev.dependencies]
pytest = "^7.4.2" pytest = "^7.4.2"
pre-commit = "^3.7.0"
[build-system] [build-system]
requires = ["poetry-core"] requires = ["poetry-core"]

View File

@@ -1,40 +1,375 @@
from homeharvest import scrape_property from homeharvest import scrape_property, Property
from homeharvest.exceptions import ( import pandas as pd
InvalidSite,
InvalidListingType,
NoResultsFound, def test_realtor_pending_or_contingent():
GeoCoordsNotFound, pending_or_contingent_result = scrape_property(location="Surprise, AZ", listing_type="pending")
)
regular_result = scrape_property(location="Surprise, AZ", listing_type="for_sale", exclude_pending=True)
assert all([result is not None for result in [pending_or_contingent_result, regular_result]])
assert len(pending_or_contingent_result) != len(regular_result)
def test_realtor_pending_comps():
pending_comps = scrape_property(
location="2530 Al Lipscomb Way",
radius=5,
past_days=180,
listing_type="pending",
)
for_sale_comps = scrape_property(
location="2530 Al Lipscomb Way",
radius=5,
past_days=180,
listing_type="for_sale",
)
sold_comps = scrape_property(
location="2530 Al Lipscomb Way",
radius=5,
past_days=180,
listing_type="sold",
)
results = [pending_comps, for_sale_comps, sold_comps]
assert all([result is not None for result in results])
#: assert all lengths are different
assert len(set([len(result) for result in results])) == len(results)
def test_realtor_sold_past():
result = scrape_property(
location="San Diego, CA",
past_days=30,
listing_type="sold",
)
assert result is not None and len(result) > 0
def test_realtor_comps():
result = scrape_property(
location="2530 Al Lipscomb Way",
radius=0.5,
past_days=180,
listing_type="sold",
)
assert result is not None and len(result) > 0
def test_realtor_last_x_days_sold():
days_result_30 = scrape_property(location="Dallas, TX", listing_type="sold", past_days=30)
days_result_10 = scrape_property(location="Dallas, TX", listing_type="sold", past_days=10)
assert all([result is not None for result in [days_result_30, days_result_10]]) and len(days_result_30) != len(
days_result_10
)
def test_realtor_date_range_sold():
days_result_30 = scrape_property(
location="Dallas, TX", listing_type="sold", date_from="2023-05-01", date_to="2023-05-28"
)
days_result_60 = scrape_property(
location="Dallas, TX", listing_type="sold", date_from="2023-04-01", date_to="2023-06-10"
)
assert all([result is not None for result in [days_result_30, days_result_60]]) and len(days_result_30) < len(
days_result_60
)
def test_realtor_single_property():
results = [
scrape_property(
location="15509 N 172nd Dr, Surprise, AZ 85388",
listing_type="for_sale",
),
scrape_property(
location="2530 Al Lipscomb Way",
listing_type="for_sale",
),
]
assert all([result is not None for result in results])
def test_realtor(): def test_realtor():
results = [ results = [
scrape_property( scrape_property(
location="2530 Al Lipscomb Way", location="2530 Al Lipscomb Way",
site_name="realtor.com",
listing_type="for_sale", listing_type="for_sale",
), ),
scrape_property( scrape_property(
location="Phoenix, AZ", site_name=["realtor.com"], listing_type="for_rent" location="Phoenix, AZ", listing_type="for_rent", limit=1000
), #: does not support "city, state, USA" format ), #: does not support "city, state, USA" format
scrape_property( scrape_property(
location="Dallas, TX", site_name="realtor.com", listing_type="sold" location="Dallas, TX", listing_type="sold", limit=1000
), #: does not support "city, state, USA" format ), #: does not support "city, state, USA" format
scrape_property(location="85281", site_name="realtor.com"), scrape_property(location="85281"),
] ]
assert all([result is not None for result in results]) assert all([result is not None for result in results])
bad_results = []
try: def test_realtor_city():
bad_results += [ results = scrape_property(location="Atlanta, GA", listing_type="for_sale", limit=1000)
scrape_property(
assert results is not None and len(results) > 0
def test_realtor_land():
results = scrape_property(location="Atlanta, GA", listing_type="for_sale", property_type=["land"], limit=1000)
assert results is not None and len(results) > 0
def test_realtor_bad_address():
bad_results = scrape_property(
location="abceefg ju098ot498hh9", location="abceefg ju098ot498hh9",
site_name="realtor.com",
listing_type="for_sale", listing_type="for_sale",
) )
]
except (InvalidSite, InvalidListingType, NoResultsFound, GeoCoordsNotFound): if len(bad_results) == 0:
assert True assert True
assert all([result is None for result in bad_results])
def test_realtor_foreclosed():
foreclosed = scrape_property(location="Dallas, TX", listing_type="for_sale", past_days=100, foreclosure=True)
not_foreclosed = scrape_property(location="Dallas, TX", listing_type="for_sale", past_days=100, foreclosure=False)
assert len(foreclosed) != len(not_foreclosed)
def test_realtor_agent():
scraped = scrape_property(location="Detroit, MI", listing_type="for_sale", limit=1000, extra_property_data=False)
assert scraped["agent_name"].nunique() > 1
def test_realtor_without_extra_details():
results = [
scrape_property(
location="00741",
listing_type="sold",
limit=10,
extra_property_data=False,
),
scrape_property(
location="00741",
listing_type="sold",
limit=10,
extra_property_data=True,
),
]
assert not results[0].equals(results[1])
def test_pr_zip_code():
results = scrape_property(
location="00741",
listing_type="for_sale",
)
assert results is not None and len(results) > 0
def test_exclude_pending():
results = scrape_property(
location="33567",
listing_type="pending",
exclude_pending=True,
)
assert results is not None and len(results) > 0
def test_style_value_error():
results = scrape_property(
location="Alaska, AK",
listing_type="sold",
extra_property_data=False,
limit=1000,
)
assert results is not None and len(results) > 0
def test_primary_image_error():
results = scrape_property(
location="Spokane, PA",
listing_type="for_rent", # or (for_sale, for_rent, pending)
past_days=360,
radius=3,
extra_property_data=False,
)
assert results is not None and len(results) > 0
def test_limit():
over_limit = 876
extra_params = {"limit": over_limit}
over_results = scrape_property(
location="Waddell, AZ",
listing_type="for_sale",
**extra_params,
)
assert over_results is not None and len(over_results) <= over_limit
under_limit = 1
under_results = scrape_property(
location="Waddell, AZ",
listing_type="for_sale",
limit=under_limit,
)
assert under_results is not None and len(under_results) == under_limit
def test_apartment_list_price():
results = scrape_property(
location="Spokane, WA",
listing_type="for_rent", # or (for_sale, for_rent, pending)
extra_property_data=False,
)
assert results is not None
results = results[results["style"] == "APARTMENT"]
#: get percentage of results with atleast 1 of any column not none, list_price, list_price_min, list_price_max
assert (
len(results[results[["list_price", "list_price_min", "list_price_max"]].notnull().any(axis=1)]) / len(results)
> 0.5
)
def test_phone_number_matching():
searches = [
scrape_property(
location="Phoenix, AZ",
listing_type="for_sale",
limit=100,
),
scrape_property(
location="Phoenix, AZ",
listing_type="for_sale",
limit=100,
),
]
assert all([search is not None for search in searches])
#: random row
row = searches[0][searches[0]["agent_phones"].notnull()].sample()
#: find matching row
matching_row = searches[1].loc[searches[1]["property_url"] == row["property_url"].values[0]]
#: assert phone numbers are the same
assert row["agent_phones"].values[0] == matching_row["agent_phones"].values[0]
def test_return_type():
results = {
"pandas": [scrape_property(location="Surprise, AZ", listing_type="for_rent", limit=100)],
"pydantic": [scrape_property(location="Surprise, AZ", listing_type="for_rent", limit=100, return_type="pydantic")],
"raw": [
scrape_property(location="Surprise, AZ", listing_type="for_rent", limit=100, return_type="raw"),
scrape_property(location="66642", listing_type="for_rent", limit=100, return_type="raw"),
],
}
assert all(isinstance(result, pd.DataFrame) for result in results["pandas"])
assert all(isinstance(result[0], Property) for result in results["pydantic"])
assert all(isinstance(result[0], dict) for result in results["raw"])
def test_has_open_house():
address_result = scrape_property("1 Hawthorne St Unit 12F, San Francisco, CA 94105", return_type="raw")
assert address_result[0]["open_houses"] is not None #: has open house data from address search
zip_code_result = scrape_property("94105", return_type="raw")
address_from_zip_result = list(filter(lambda row: row["property_id"] == '1264014746', zip_code_result))
assert address_from_zip_result[0]["open_houses"] is not None #: has open house data from general search
def test_return_type_consistency():
"""Test that return_type works consistently between general and address searches"""
# Test configurations - different search types
test_locations = [
("Dallas, TX", "general"), # General city search
("75201", "zip"), # ZIP code search
("2530 Al Lipscomb Way", "address") # Address search
]
for location, search_type in test_locations:
# Test all return types for each search type
pandas_result = scrape_property(
location=location,
listing_type="for_sale",
limit=3,
return_type="pandas"
)
pydantic_result = scrape_property(
location=location,
listing_type="for_sale",
limit=3,
return_type="pydantic"
)
raw_result = scrape_property(
location=location,
listing_type="for_sale",
limit=3,
return_type="raw"
)
# Validate pandas return type
assert isinstance(pandas_result, pd.DataFrame), f"pandas result should be DataFrame for {search_type}"
assert len(pandas_result) > 0, f"pandas result should not be empty for {search_type}"
required_columns = ["property_id", "property_url", "list_price", "status", "formatted_address"]
for col in required_columns:
assert col in pandas_result.columns, f"Missing column {col} in pandas result for {search_type}"
# Validate pydantic return type
assert isinstance(pydantic_result, list), f"pydantic result should be list for {search_type}"
assert len(pydantic_result) > 0, f"pydantic result should not be empty for {search_type}"
for item in pydantic_result:
assert isinstance(item, Property), f"pydantic items should be Property objects for {search_type}"
assert item.property_id is not None, f"property_id should not be None for {search_type}"
# Validate raw return type
assert isinstance(raw_result, list), f"raw result should be list for {search_type}"
assert len(raw_result) > 0, f"raw result should not be empty for {search_type}"
for item in raw_result:
assert isinstance(item, dict), f"raw items should be dict for {search_type}"
assert "property_id" in item, f"raw items should have property_id for {search_type}"
assert "href" in item, f"raw items should have href for {search_type}"
# Cross-validate that different return types return related data
pandas_ids = set(pandas_result["property_id"].tolist())
pydantic_ids = set(prop.property_id for prop in pydantic_result)
raw_ids = set(item["property_id"] for item in raw_result)
# All return types should have some properties
assert len(pandas_ids) > 0, f"pandas should return properties for {search_type}"
assert len(pydantic_ids) > 0, f"pydantic should return properties for {search_type}"
assert len(raw_ids) > 0, f"raw should return properties for {search_type}"

View File

@@ -1,38 +0,0 @@
from homeharvest import scrape_property
from homeharvest.exceptions import (
InvalidSite,
InvalidListingType,
NoResultsFound,
GeoCoordsNotFound,
)
def test_redfin():
results = [
scrape_property(
location="2530 Al Lipscomb Way", site_name="redfin", listing_type="for_sale"
),
scrape_property(
location="Phoenix, AZ, USA", site_name=["redfin"], listing_type="for_rent"
),
scrape_property(
location="Dallas, TX, USA", site_name="redfin", listing_type="sold"
),
scrape_property(location="85281", site_name="redfin"),
]
assert all([result is not None for result in results])
bad_results = []
try:
bad_results += [
scrape_property(
location="abceefg ju098ot498hh9",
site_name="redfin",
listing_type="for_sale",
)
]
except (InvalidSite, InvalidListingType, NoResultsFound, GeoCoordsNotFound):
assert True
assert all([result is None for result in bad_results])

View File

@@ -1,38 +0,0 @@
from homeharvest import scrape_property
from homeharvest.exceptions import (
InvalidSite,
InvalidListingType,
NoResultsFound,
GeoCoordsNotFound,
)
def test_zillow():
results = [
scrape_property(
location="2530 Al Lipscomb Way", site_name="zillow", listing_type="for_sale"
),
scrape_property(
location="Phoenix, AZ, USA", site_name=["zillow"], listing_type="for_rent"
),
scrape_property(
location="Dallas, TX, USA", site_name="zillow", listing_type="sold"
),
scrape_property(location="85281", site_name="zillow"),
]
assert all([result is not None for result in results])
bad_results = []
try:
bad_results += [
scrape_property(
location="abceefg ju098ot498hh9",
site_name="zillow",
listing_type="for_sale",
)
]
except (InvalidSite, InvalidListingType, NoResultsFound, GeoCoordsNotFound):
assert True
assert all([result is None for result in bad_results])