Compare commits

...

5 Commits

Author SHA1 Message Date
Zachary Hampton
4cef926d7d Merge pull request #14 from ZacharyHampton/keep_duplicates_flag
Keep duplicates flag
2023-09-20 20:27:08 -07:00
Cullen Watson
e82eeaa59f docs: add keep duplicates flag 2023-09-20 20:25:50 -05:00
Cullen Watson
644f16b25b feat: keep duplicates flag 2023-09-20 20:24:18 -05:00
Cullen Watson
e9ddc6df92 docs: update tutorial vid for release v0.2.7 2023-09-19 22:18:49 -05:00
Cullen Watson
50fb1c391d docs: update property schema 2023-09-19 21:35:37 -05:00
4 changed files with 38 additions and 34 deletions

View File

@@ -10,7 +10,7 @@
- Scrapes properties from **Zillow**, **Realtor.com** & **Redfin** simultaneously - Scrapes properties from **Zillow**, **Realtor.com** & **Redfin** simultaneously
- Aggregates the properties in a Pandas DataFrame - Aggregates the properties in a Pandas DataFrame
[Video Guide for HomeHarvest](https://www.youtube.com/watch?v=HCoHoiJdWQY) [Video Guide for HomeHarvest](https://youtu.be/JnV7eR2Ve2o) - _updated for release v0.2.7_
![homeharvest](https://github.com/ZacharyHampton/HomeHarvest/assets/78247585/b3d5d727-e67b-4a9f-85d8-1e65fd18620a) ![homeharvest](https://github.com/ZacharyHampton/HomeHarvest/assets/78247585/b3d5d727-e67b-4a9f-85d8-1e65fd18620a)
@@ -37,6 +37,7 @@ By default:
- The `-o` or `--output` default format is `excel`. Options are `csv` or `excel`. - The `-o` or `--output` default format is `excel`. Options are `csv` or `excel`.
- If `-f` or `--filename` is left blank, the default is `HomeHarvest_<current_timestamp>`. - If `-f` or `--filename` is left blank, the default is `HomeHarvest_<current_timestamp>`.
- If `-p` or `--proxy` is not provided, the scraper uses the local IP. - If `-p` or `--proxy` is not provided, the scraper uses the local IP.
- Use `-k` or `--keep_duplicates` to keep duplicate properties based on address. If not provided, duplicates will be removed.
### Python ### Python
```py ```py
@@ -71,8 +72,9 @@ Required
├── location (str): address in various formats e.g. just zip, full address, city/state, etc. ├── location (str): address in various formats e.g. just zip, full address, city/state, etc.
└── listing_type (enum): for_rent, for_sale, sold └── listing_type (enum): for_rent, for_sale, sold
Optional Optional
├── site_name (List[enum], default=all three sites): zillow, realtor.com, redfin ├── site_name (list[enum], default=all three sites): zillow, realtor.com, redfin
├── proxy (str): in format 'http://user:pass@host:port' or [https, socks] ├── proxy (str): in format 'http://user:pass@host:port' or [https, socks]
└── keep_duplicates (bool, default=False): whether to keep or remove duplicate properties based on address
``` ```
### Property Schema ### Property Schema
@@ -81,7 +83,7 @@ Property
├── Basic Information: ├── Basic Information:
│ ├── property_url (str) │ ├── property_url (str)
│ ├── site_name (enum): zillow, redfin, realtor.com │ ├── site_name (enum): zillow, redfin, realtor.com
│ ├── listing_type (enum: ListingType) │ ├── listing_type (enum): for_sale, for_rent, sold
│ └── property_type (enum): house, apartment, condo, townhouse, single_family, multi_family, building │ └── property_type (enum): house, apartment, condo, townhouse, single_family, multi_family, building
├── Address Details: ├── Address Details:
@@ -92,45 +94,38 @@ Property
│ ├── unit (str) │ ├── unit (str)
│ └── country (str) │ └── country (str)
├── Property Features: ├── House for Sale Features:
│ ├── price (int)
│ ├── tax_assessed_value (int) │ ├── tax_assessed_value (int)
│ ├── currency (str)
│ ├── square_feet (int)
│ ├── beds (int)
│ ├── baths (float)
│ ├── lot_area_value (float) │ ├── lot_area_value (float)
│ ├── lot_area_unit (str) │ ├── lot_area_unit (str)
│ ├── stories (int) │ ├── stories (int)
── year_built (int) ── year_built (int)
│ └── price_per_sqft (int)
├── Building for Sale and Apartment Details:
│ ├── bldg_name (str)
│ ├── beds_min (int)
│ ├── beds_max (int)
│ ├── baths_min (float)
│ ├── baths_max (float)
│ ├── sqft_min (int)
│ ├── sqft_max (int)
│ ├── price_min (int)
│ ├── price_max (int)
│ ├── area_min (int)
│ └── unit_count (int)
├── Miscellaneous Details: ├── Miscellaneous Details:
│ ├── price_per_sqft (int)
│ ├── mls_id (str) │ ├── mls_id (str)
│ ├── agent_name (str) │ ├── agent_name (str)
│ ├── img_src (str) │ ├── img_src (str)
│ ├── description (str) │ ├── description (str)
│ ├── status_text (str) │ ├── status_text (str)
── latitude (float) ── posted_time (str)
│ ├── longitude (float)
│ └── posted_time (str) [Only for Zillow]
── Building Details (for property_type: building): ── Location Details:
├── bldg_name (str) ├── latitude (float)
── bldg_unit_count (int) ── longitude (float)
│ ├── bldg_min_beds (int)
│ ├── bldg_min_baths (float)
│ └── bldg_min_area (int)
└── Apartment Details (for property type: apartment):
├── apt_min_beds: int
├── apt_max_beds: int
├── apt_min_baths: float
├── apt_max_baths: float
├── apt_min_price: int
├── apt_max_price: int
├── apt_min_sqft: int
├── apt_max_sqft: int
``` ```
## Supported Countries for Property Scraping ## Supported Countries for Property Scraping
@@ -144,7 +139,7 @@ The following exceptions may be raised when using HomeHarvest:
- `InvalidSite` - valid options: `zillow`, `redfin`, `realtor.com` - `InvalidSite` - valid options: `zillow`, `redfin`, `realtor.com`
- `InvalidListingType` - valid options: `for_sale`, `for_rent`, `sold` - `InvalidListingType` - valid options: `for_sale`, `for_rent`, `sold`
- `NoResultsFound` - no properties found from your input - `NoResultsFound` - no properties found from your input
- `GeoCoordsNotFound` - if Zillow scraper is not able to create geo-coordinates from the location you input - `GeoCoordsNotFound` - if Zillow scraper is not able to derive geo-coordinates from the location you input
## Frequently Asked Questions ## Frequently Asked Questions

View File

@@ -119,6 +119,7 @@ def scrape_property(
site_name: Union[str, list[str]] = None, site_name: Union[str, list[str]] = None,
listing_type: str = "for_sale", listing_type: str = "for_sale",
proxy: str = None, proxy: str = None,
keep_duplicates: bool = False
) -> pd.DataFrame: ) -> pd.DataFrame:
""" """
Scrape property from various sites from a given location and listing type. Scrape property from various sites from a given location and listing type.
@@ -165,5 +166,6 @@ def scrape_property(
if col not in final_df.columns: if col not in final_df.columns:
final_df[col] = None final_df[col] = None
final_df = final_df.drop_duplicates(subset=columns_to_track, keep="first") if not keep_duplicates:
final_df = final_df.drop_duplicates(subset=columns_to_track, keep="first")
return final_df return final_df

View File

@@ -42,11 +42,18 @@ def main():
help="Name of the output file (without extension)", help="Name of the output file (without extension)",
) )
parser.add_argument(
"-k",
"--keep_duplicates",
action="store_true",
help="Keep duplicate properties based on address"
)
parser.add_argument("-p", "--proxy", type=str, default=None, help="Proxy to use for scraping") parser.add_argument("-p", "--proxy", type=str, default=None, help="Proxy to use for scraping")
args = parser.parse_args() args = parser.parse_args()
result = scrape_property(args.location, args.site_name, args.listing_type, proxy=args.proxy) result = scrape_property(args.location, args.site_name, args.listing_type, proxy=args.proxy, keep_duplicates=args.keep_duplicates)
if not args.filename: if not args.filename:
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S") timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")

View File

@@ -1,6 +1,6 @@
[tool.poetry] [tool.poetry]
name = "homeharvest" name = "homeharvest"
version = "0.2.7" version = "0.2.8"
description = "Real estate scraping library supporting Zillow, Realtor.com & Redfin." description = "Real estate scraping library supporting Zillow, Realtor.com & Redfin."
authors = ["Zachary Hampton <zachary@zacharysproducts.com>", "Cullen Watson <cullen@cullen.ai>"] authors = ["Zachary Hampton <zachary@zacharysproducts.com>", "Cullen Watson <cullen@cullen.ai>"]
homepage = "https://github.com/ZacharyHampton/HomeHarvest" homepage = "https://github.com/ZacharyHampton/HomeHarvest"