Compare commits

..

2 Commits

Author SHA1 Message Date
Cullen
78c1ec8e9f [fix] add compensation 2023-10-28 16:13:10 -05:00
Cullen
a2dd93aca1 [enh] use ziprecuriter api 2023-10-28 15:50:28 -05:00
14 changed files with 265 additions and 493 deletions

View File

@@ -4,15 +4,15 @@
**Not technical?** Try out the web scraping tool on our site at [usejobspy.com](https://usejobspy.com). **Not technical?** Try out the web scraping tool on our site at [usejobspy.com](https://usejobspy.com).
*Looking to build a data-focused software product?* **[Book a call](https://bunsly.com/)** *to *Looking to build a data-focused software product?* **[Book a call](https://calendly.com/bunsly/15min)** *to
work with us.* work with us.*
\
Check out another project we wrote: ***[HomeHarvest](https://github.com/Bunsly/HomeHarvest)** a Python package Check out another project we wrote: ***[HomeHarvest](https://github.com/Bunsly/HomeHarvest)** a Python package
for real estate scraping* for real estate scraping*
## Features ## Features
- Scrapes job postings from **LinkedIn**, **Indeed**, **Glassdoor**, & **ZipRecruiter** simultaneously - Scrapes job postings from **LinkedIn**, **Indeed** & **ZipRecruiter** simultaneously
- Aggregates the job postings in a Pandas DataFrame - Aggregates the job postings in a Pandas DataFrame
- Proxy support (HTTP/S, SOCKS) - Proxy support (HTTP/S, SOCKS)
@@ -35,15 +35,15 @@ _Python version >= [3.10](https://www.python.org/downloads/release/python-3100/)
from jobspy import scrape_jobs from jobspy import scrape_jobs
jobs = scrape_jobs( jobs = scrape_jobs(
site_name=["indeed", "linkedin", "zip_recruiter", "glassdoor"], site_name=["indeed", "linkedin", "zip_recruiter"],
search_term="software engineer", search_term="software engineer",
location="Dallas, TX", location="Dallas, TX",
results_wanted=10, results_wanted=10,
country_indeed='USA' # only needed for indeed / glassdoor country_indeed='USA' # only needed for indeed
) )
print(f"Found {len(jobs)} jobs") print(f"Found {len(jobs)} jobs")
print(jobs.head()) print(jobs.head())
jobs.to_csv("jobs.csv", index=False) # to_xlsx jobs.to_csv("jobs.csv", index=False) # / to_xlsx
``` ```
### Output ### Output
@@ -62,7 +62,7 @@ zip_recruiter Software Developer TEKsystems Phoenix
```plaintext ```plaintext
Required Required
├── site_type (List[enum]): linkedin, zip_recruiter, indeed, glassdoor ├── site_type (List[enum]): linkedin, zip_recruiter, indeed
└── search_term (str) └── search_term (str)
Optional Optional
├── location (int) ├── location (int)
@@ -107,46 +107,43 @@ The following exceptions may be raised when using JobSpy:
* `LinkedInException` * `LinkedInException`
* `IndeedException` * `IndeedException`
* `ZipRecruiterException` * `ZipRecruiterException`
* `GlassdoorException`
## Supported Countries for Job Searching ## Supported Countries for Job Searching
### **LinkedIn** ### **LinkedIn**
LinkedIn searches globally & uses only the `location` parameter. You can only fetch 1000 jobs max from the LinkedIn endpoint we're using LinkedIn searches globally & uses only the `location` parameter.
### **ZipRecruiter** ### **ZipRecruiter**
ZipRecruiter searches for jobs in **US/Canada** & uses only the `location` parameter. ZipRecruiter searches for jobs in **US/Canada** & uses only the `location` parameter.
### **Indeed / Glassdoor** ### **Indeed**
Indeed & Glassdoor supports most countries, but the `country_indeed` parameter is required. Additionally, use the `location` Indeed supports most countries, but the `country_indeed` parameter is required. Additionally, use the `location`
parameter to narrow down the location, e.g. city & state if necessary. parameter to narrow down the location, e.g. city & state if necessary.
You can specify the following countries when searching on Indeed (use the exact name, * indicates support for Glassdoor): You can specify the following countries when searching on Indeed (use the exact name):
| | | | | | | | | |
|----------------------|--------------|------------|----------------| |----------------------|--------------|------------|----------------|
| Argentina | Australia* | Austria* | Bahrain | | Argentina | Australia | Austria | Bahrain |
| Belgium* | Brazil* | Canada* | Chile | | Belgium | Brazil | Canada | Chile |
| China | Colombia | Costa Rica | Czech Republic | | China | Colombia | Costa Rica | Czech Republic |
| Denmark | Ecuador | Egypt | Finland | | Denmark | Ecuador | Egypt | Finland |
| France* | Germany* | Greece | Hong Kong* | | France | Germany | Greece | Hong Kong |
| Hungary | India* | Indonesia | Ireland* | | Hungary | India | Indonesia | Ireland |
| Israel | Italy* | Japan | Kuwait | | Israel | Italy | Japan | Kuwait |
| Luxembourg | Malaysia | Mexico* | Morocco | | Luxembourg | Malaysia | Mexico | Morocco |
| Netherlands* | New Zealand* | Nigeria | Norway | | Netherlands | New Zealand | Nigeria | Norway |
| Oman | Pakistan | Panama | Peru | | Oman | Pakistan | Panama | Peru |
| Philippines | Poland | Portugal | Qatar | | Philippines | Poland | Portugal | Qatar |
| Romania | Saudi Arabia | Singapore* | South Africa | | Romania | Saudi Arabia | Singapore | South Africa |
| South Korea | Spain* | Sweden | Switzerland* | | South Korea | Spain | Sweden | Switzerland |
| Taiwan | Thailand | Turkey | Ukraine | | Taiwan | Thailand | Turkey | Ukraine |
| United Arab Emirates | UK* | USA* | Uruguay | | United Arab Emirates | UK | USA | Uruguay |
| Venezuela | Vietnam | | | | Venezuela | Vietnam | | |
Glassdoor can only fetch 900 jobs from the endpoint we're using on a given search.
## Frequently Asked Questions ## Frequently Asked Questions
--- ---

View File

@@ -1,7 +1,7 @@
[tool.poetry] [tool.poetry]
name = "python-jobspy" name = "python-jobspy"
version = "1.1.26" version = "1.1.16"
description = "Job scraper for LinkedIn, Indeed, Glassdoor & ZipRecruiter" description = "Job scraper for LinkedIn, Indeed & ZipRecruiter"
authors = ["Zachary Hampton <zachary@bunsly.com>", "Cullen Watson <cullen@bunsly.com>"] authors = ["Zachary Hampton <zachary@bunsly.com>", "Cullen Watson <cullen@bunsly.com>"]
homepage = "https://github.com/Bunsly/JobSpy" homepage = "https://github.com/Bunsly/JobSpy"
readme = "README.md" readme = "README.md"

View File

@@ -6,21 +6,18 @@ from typing import Tuple, Optional
from .jobs import JobType, Location from .jobs import JobType, Location
from .scrapers.indeed import IndeedScraper from .scrapers.indeed import IndeedScraper
from .scrapers.ziprecruiter import ZipRecruiterScraper from .scrapers.ziprecruiter import ZipRecruiterScraper
from .scrapers.glassdoor import GlassdoorScraper
from .scrapers.linkedin import LinkedInScraper from .scrapers.linkedin import LinkedInScraper
from .scrapers import ScraperInput, Site, JobResponse, Country from .scrapers import ScraperInput, Site, JobResponse, Country
from .scrapers.exceptions import ( from .scrapers.exceptions import (
LinkedInException, LinkedInException,
IndeedException, IndeedException,
ZipRecruiterException, ZipRecruiterException,
GlassdoorException,
) )
SCRAPER_MAPPING = { SCRAPER_MAPPING = {
Site.LINKEDIN: LinkedInScraper, Site.LINKEDIN: LinkedInScraper,
Site.INDEED: IndeedScraper, Site.INDEED: IndeedScraper,
Site.ZIP_RECRUITER: ZipRecruiterScraper, Site.ZIP_RECRUITER: ZipRecruiterScraper,
Site.GLASSDOOR: GlassdoorScraper,
} }
@@ -93,8 +90,6 @@ def scrape_jobs(
raise IndeedException(str(e)) raise IndeedException(str(e))
if site == Site.ZIP_RECRUITER: if site == Site.ZIP_RECRUITER:
raise ZipRecruiterException(str(e)) raise ZipRecruiterException(str(e))
if site == Site.GLASSDOOR:
raise GlassdoorException(str(e))
else: else:
raise e raise e
return site.value, scraped_data return site.value, scraped_data
@@ -132,10 +127,7 @@ def scrape_jobs(
job_data["emails"] = ( job_data["emails"] = (
", ".join(job_data["emails"]) if job_data["emails"] else None ", ".join(job_data["emails"]) if job_data["emails"] else None
) )
if job_data["location"]: job_data["location"] = Location(**job_data["location"]).display_location()
job_data["location"] = Location(
**job_data["location"]
).display_location()
compensation_obj = job_data.get("compensation") compensation_obj = job_data.get("compensation")
if compensation_obj and isinstance(compensation_obj, dict): if compensation_obj and isinstance(compensation_obj, dict):
@@ -163,7 +155,6 @@ def scrape_jobs(
"site", "site",
"title", "title",
"company", "company",
"company_url",
"location", "location",
"job_type", "job_type",
"date_posted", "date_posted",

View File

@@ -1,6 +1,7 @@
from typing import Union, Optional from typing import Union, Optional
from datetime import date from datetime import date
from enum import Enum from enum import Enum
from pydantic import BaseModel, validator from pydantic import BaseModel, validator
@@ -55,13 +56,13 @@ class JobType(Enum):
class Country(Enum): class Country(Enum):
ARGENTINA = ("argentina", "com.ar") ARGENTINA = ("argentina", "ar")
AUSTRALIA = ("australia", "au", "com.au") AUSTRALIA = ("australia", "au")
AUSTRIA = ("austria", "at", "at") AUSTRIA = ("austria", "at")
BAHRAIN = ("bahrain", "bh") BAHRAIN = ("bahrain", "bh")
BELGIUM = ("belgium", "be", "nl:be") BELGIUM = ("belgium", "be")
BRAZIL = ("brazil", "br", "com.br") BRAZIL = ("brazil", "br")
CANADA = ("canada", "ca", "ca") CANADA = ("canada", "ca")
CHILE = ("chile", "cl") CHILE = ("chile", "cl")
CHINA = ("china", "cn") CHINA = ("china", "cn")
COLOMBIA = ("colombia", "co") COLOMBIA = ("colombia", "co")
@@ -71,24 +72,24 @@ class Country(Enum):
ECUADOR = ("ecuador", "ec") ECUADOR = ("ecuador", "ec")
EGYPT = ("egypt", "eg") EGYPT = ("egypt", "eg")
FINLAND = ("finland", "fi") FINLAND = ("finland", "fi")
FRANCE = ("france", "fr", "fr") FRANCE = ("france", "fr")
GERMANY = ("germany", "de", "de") GERMANY = ("germany", "de")
GREECE = ("greece", "gr") GREECE = ("greece", "gr")
HONGKONG = ("hong kong", "hk", "com.hk") HONGKONG = ("hong kong", "hk")
HUNGARY = ("hungary", "hu") HUNGARY = ("hungary", "hu")
INDIA = ("india", "in", "co.in") INDIA = ("india", "in")
INDONESIA = ("indonesia", "id") INDONESIA = ("indonesia", "id")
IRELAND = ("ireland", "ie", "ie") IRELAND = ("ireland", "ie")
ISRAEL = ("israel", "il") ISRAEL = ("israel", "il")
ITALY = ("italy", "it", "it") ITALY = ("italy", "it")
JAPAN = ("japan", "jp") JAPAN = ("japan", "jp")
KUWAIT = ("kuwait", "kw") KUWAIT = ("kuwait", "kw")
LUXEMBOURG = ("luxembourg", "lu") LUXEMBOURG = ("luxembourg", "lu")
MALAYSIA = ("malaysia", "malaysia") MALAYSIA = ("malaysia", "malaysia")
MEXICO = ("mexico", "mx", "com.mx") MEXICO = ("mexico", "mx")
MOROCCO = ("morocco", "ma") MOROCCO = ("morocco", "ma")
NETHERLANDS = ("netherlands", "nl", "nl") NETHERLANDS = ("netherlands", "nl")
NEWZEALAND = ("new zealand", "nz", "co.nz") NEWZEALAND = ("new zealand", "nz")
NIGERIA = ("nigeria", "ng") NIGERIA = ("nigeria", "ng")
NORWAY = ("norway", "no") NORWAY = ("norway", "no")
OMAN = ("oman", "om") OMAN = ("oman", "om")
@@ -101,19 +102,19 @@ class Country(Enum):
QATAR = ("qatar", "qa") QATAR = ("qatar", "qa")
ROMANIA = ("romania", "ro") ROMANIA = ("romania", "ro")
SAUDIARABIA = ("saudi arabia", "sa") SAUDIARABIA = ("saudi arabia", "sa")
SINGAPORE = ("singapore", "sg", "sg") SINGAPORE = ("singapore", "sg")
SOUTHAFRICA = ("south africa", "za") SOUTHAFRICA = ("south africa", "za")
SOUTHKOREA = ("south korea", "kr") SOUTHKOREA = ("south korea", "kr")
SPAIN = ("spain", "es", "es") SPAIN = ("spain", "es")
SWEDEN = ("sweden", "se") SWEDEN = ("sweden", "se")
SWITZERLAND = ("switzerland", "ch", "de:ch") SWITZERLAND = ("switzerland", "ch")
TAIWAN = ("taiwan", "tw") TAIWAN = ("taiwan", "tw")
THAILAND = ("thailand", "th") THAILAND = ("thailand", "th")
TURKEY = ("turkey", "tr") TURKEY = ("turkey", "tr")
UKRAINE = ("ukraine", "ua") UKRAINE = ("ukraine", "ua")
UNITEDARABEMIRATES = ("united arab emirates", "ae") UNITEDARABEMIRATES = ("united arab emirates", "ae")
UK = ("uk", "uk", "co.uk") UK = ("uk", "uk")
USA = ("usa", "www", "com") USA = ("usa", "www")
URUGUAY = ("uruguay", "uy") URUGUAY = ("uruguay", "uy")
VENEZUELA = ("venezuela", "ve") VENEZUELA = ("venezuela", "ve")
VIETNAM = ("vietnam", "vn") VIETNAM = ("vietnam", "vn")
@@ -124,39 +125,31 @@ class Country(Enum):
# internal for linkeind # internal for linkeind
WORLDWIDE = ("worldwide", "www") WORLDWIDE = ("worldwide", "www")
@property def __new__(cls, country, domain):
def indeed_domain_value(self): obj = object.__new__(cls)
return self.value[1] obj._value_ = country
obj.domain = domain
return obj
@property @property
def glassdoor_domain_value(self): def domain_value(self):
if len(self.value) == 3: return self.domain
subdomain, _, domain = self.value[2].partition(":")
if subdomain and domain:
return f"{subdomain}.glassdoor.{domain}"
else:
return f"www.glassdoor.{self.value[2]}"
else:
raise Exception(f"Glassdoor is not available for {self.name}")
def get_url(self):
return f"https://{self.glassdoor_domain_value}/"
@classmethod @classmethod
def from_string(cls, country_str: str): def from_string(cls, country_str: str):
"""Convert a string to the corresponding Country enum.""" """Convert a string to the corresponding Country enum."""
country_str = country_str.strip().lower() country_str = country_str.strip().lower()
for country in cls: for country in cls:
if country.value[0] == country_str: if country.value == country_str:
return country return country
valid_countries = [country.value for country in cls] valid_countries = [country.value for country in cls]
raise ValueError( raise ValueError(
f"Invalid country string: '{country_str}'. Valid countries are: {', '.join([country[0] for country in valid_countries])}" f"Invalid country string: '{country_str}'. Valid countries (only include this param for Indeed) are: {', '.join(valid_countries)}"
) )
class Location(BaseModel): class Location(BaseModel):
country: Country | None = None country: Country = None
city: Optional[str] = None city: Optional[str] = None
state: Optional[str] = None state: Optional[str] = None
@@ -167,10 +160,10 @@ class Location(BaseModel):
if self.state: if self.state:
location_parts.append(self.state) location_parts.append(self.state)
if self.country and self.country not in (Country.US_CANADA, Country.WORLDWIDE): if self.country and self.country not in (Country.US_CANADA, Country.WORLDWIDE):
if self.country.value[0] in ("usa", "uk"): if self.country.value in ("usa", "uk"):
location_parts.append(self.country.value[0].upper()) location_parts.append(self.country.value.upper())
else: else:
location_parts.append(self.country.value[0].title()) location_parts.append(self.country.value.title())
return ", ".join(location_parts) return ", ".join(location_parts)
@@ -196,8 +189,6 @@ class JobPost(BaseModel):
location: Optional[Location] location: Optional[Location]
description: str | None = None description: str | None = None
company_url: str | None = None
job_type: list[JobType] | None = None job_type: list[JobType] | None = None
compensation: Compensation | None = None compensation: Compensation | None = None
date_posted: date | None = None date_posted: date | None = None

View File

@@ -6,7 +6,6 @@ class Site(Enum):
LINKEDIN = "linkedin" LINKEDIN = "linkedin"
INDEED = "indeed" INDEED = "indeed"
ZIP_RECRUITER = "zip_recruiter" ZIP_RECRUITER = "zip_recruiter"
GLASSDOOR = "glassdoor"
class ScraperInput(BaseModel): class ScraperInput(BaseModel):

View File

@@ -19,8 +19,3 @@ class IndeedException(Exception):
class ZipRecruiterException(Exception): class ZipRecruiterException(Exception):
def __init__(self, message=None): def __init__(self, message=None):
super().__init__(message or "An error occurred with ZipRecruiter") super().__init__(message or "An error occurred with ZipRecruiter")
class GlassdoorException(Exception):
def __init__(self, message=None):
super().__init__(message or "An error occurred with Glassdoor")

View File

@@ -1,286 +0,0 @@
"""
jobspy.scrapers.glassdoor
~~~~~~~~~~~~~~~~~~~
This module contains routines to scrape Glassdoor.
"""
import math
import time
import re
import json
from datetime import datetime, date
from typing import Optional, Tuple, Any
from bs4 import BeautifulSoup
from .. import Scraper, ScraperInput, Site
from ..exceptions import GlassdoorException
from ..utils import count_urgent_words, extract_emails_from_text, create_session
from ...jobs import (
JobPost,
Compensation,
CompensationInterval,
Location,
JobResponse,
JobType,
Country,
)
class GlassdoorScraper(Scraper):
def __init__(self, proxy: Optional[str] = None):
"""
Initializes GlassdoorScraper with the Glassdoor job search url
"""
site = Site(Site.ZIP_RECRUITER)
super().__init__(site, proxy=proxy)
self.url = None
self.country = None
self.jobs_per_page = 30
self.seen_urls = set()
def fetch_jobs_page(
self,
scraper_input: ScraperInput,
location_id: int,
location_type: str,
page_num: int,
cursor: str | None,
) -> (list[JobPost], str | None):
"""
Scrapes a page of Glassdoor for jobs with scraper_input criteria
:param scraper_input:
:return: jobs found on page
:return: cursor for next page
"""
try:
payload = self.add_payload(
scraper_input, location_id, location_type, page_num, cursor
)
session = create_session(self.proxy, is_tls=False)
response = session.post(
f"{self.url}/graph", headers=self.headers(), timeout=10, data=payload
)
if response.status_code != 200:
raise GlassdoorException(
f"bad response status code: {response.status_code}"
)
res_json = response.json()[0]
if "errors" in res_json:
raise ValueError("Error encountered in API response")
except Exception as e:
raise GlassdoorException(str(e))
jobs_data = res_json["data"]["jobListings"]["jobListings"]
jobs = []
for i, job in enumerate(jobs_data):
job_url = res_json["data"]["jobListings"]["jobListingSeoLinks"][
"linkItems"
][i]["url"]
if job_url in self.seen_urls:
continue
self.seen_urls.add(job_url)
job = job["jobview"]
title = job["job"]["jobTitleText"]
company_name = job["header"]["employerNameFromSearch"]
location_name = job["header"].get("locationName", "")
location_type = job["header"].get("locationType", "")
is_remote = False
location = None
if location_type == "S":
is_remote = True
else:
location = self.parse_location(location_name)
compensation = self.parse_compensation(job["header"])
job = JobPost(
title=title,
company_name=company_name,
job_url=job_url,
location=location,
compensation=compensation,
is_remote=is_remote,
)
jobs.append(job)
return jobs, self.get_cursor_for_page(
res_json["data"]["jobListings"]["paginationCursors"], page_num + 1
)
def scrape(self, scraper_input: ScraperInput) -> JobResponse:
"""
Scrapes Glassdoor for jobs with scraper_input criteria.
:param scraper_input: Information about job search criteria.
:return: JobResponse containing a list of jobs.
"""
self.country = scraper_input.country
self.url = self.country.get_url()
location_id, location_type = self.get_location(
scraper_input.location, scraper_input.is_remote
)
all_jobs: list[JobPost] = []
cursor = None
max_pages = 30
try:
for page in range(
1 + (scraper_input.offset // self.jobs_per_page),
min(
(scraper_input.results_wanted // self.jobs_per_page) + 2,
max_pages + 1,
),
):
try:
jobs, cursor = self.fetch_jobs_page(
scraper_input, location_id, location_type, page, cursor
)
all_jobs.extend(jobs)
if len(all_jobs) >= scraper_input.results_wanted:
all_jobs = all_jobs[: scraper_input.results_wanted]
break
except Exception as e:
raise GlassdoorException(str(e))
except Exception as e:
raise GlassdoorException(str(e))
return JobResponse(jobs=all_jobs)
@staticmethod
def parse_compensation(data: dict) -> Optional[Compensation]:
pay_period = data.get("payPeriod")
adjusted_pay = data.get("payPeriodAdjustedPay")
currency = data.get("payCurrency", "USD")
if not pay_period or not adjusted_pay:
return None
interval = None
if pay_period == "ANNUAL":
interval = CompensationInterval.YEARLY
elif pay_period == "MONTHLY":
interval = CompensationInterval.MONTHLY
elif pay_period == "WEEKLY":
interval = CompensationInterval.WEEKLY
elif pay_period == "DAILY":
interval = CompensationInterval.DAILY
elif pay_period == "HOURLY":
interval = CompensationInterval.HOURLY
min_amount = int(adjusted_pay.get("p10") // 1)
max_amount = int(adjusted_pay.get("p90") // 1)
return Compensation(
interval=interval,
min_amount=min_amount,
max_amount=max_amount,
currency=currency,
)
def get_job_type_enum(self, job_type_str: str) -> list[JobType] | None:
for job_type in JobType:
if job_type_str in job_type.value:
return [job_type]
return None
def get_location(self, location: str, is_remote: bool) -> (int, str):
if not location or is_remote:
return "11047", "STATE" # remote options
url = f"{self.url}/findPopularLocationAjax.htm?maxLocationsToReturn=10&term={location}"
session = create_session(self.proxy)
response = session.get(url)
if response.status_code != 200:
raise GlassdoorException(
f"bad response status code: {response.status_code}"
)
items = response.json()
if not items:
raise ValueError(f"Location '{location}' not found on Glassdoor")
location_type = items[0]["locationType"]
if location_type == "C":
location_type = "CITY"
elif location_type == "S":
location_type = "STATE"
return int(items[0]["locationId"]), location_type
@staticmethod
def add_payload(
scraper_input,
location_id: int,
location_type: str,
page_num: int,
cursor: str | None = None,
) -> dict[str, str | Any]:
payload = {
"operationName": "JobSearchResultsQuery",
"variables": {
"excludeJobListingIds": [],
"filterParams": [],
"keyword": scraper_input.search_term,
"numJobsToShow": 30,
"locationType": location_type,
"locationId": int(location_id),
"parameterUrlInput": f"IL.0,12_I{location_type}{location_id}",
"pageNumber": page_num,
"pageCursor": cursor,
},
"query": "query JobSearchResultsQuery($excludeJobListingIds: [Long!], $keyword: String, $locationId: Int, $locationType: LocationTypeEnum, $numJobsToShow: Int!, $pageCursor: String, $pageNumber: Int, $filterParams: [FilterParams], $originalPageUrl: String, $seoFriendlyUrlInput: String, $parameterUrlInput: String, $seoUrl: Boolean) {\n jobListings(\n contextHolder: {searchParams: {excludeJobListingIds: $excludeJobListingIds, keyword: $keyword, locationId: $locationId, locationType: $locationType, numPerPage: $numJobsToShow, pageCursor: $pageCursor, pageNumber: $pageNumber, filterParams: $filterParams, originalPageUrl: $originalPageUrl, seoFriendlyUrlInput: $seoFriendlyUrlInput, parameterUrlInput: $parameterUrlInput, seoUrl: $seoUrl, searchType: SR}}\n ) {\n companyFilterOptions {\n id\n shortName\n __typename\n }\n filterOptions\n indeedCtk\n jobListings {\n ...JobView\n __typename\n }\n jobListingSeoLinks {\n linkItems {\n position\n url\n __typename\n }\n __typename\n }\n jobSearchTrackingKey\n jobsPageSeoData {\n pageMetaDescription\n pageTitle\n __typename\n }\n paginationCursors {\n cursor\n pageNumber\n __typename\n }\n indexablePageForSeo\n searchResultsMetadata {\n searchCriteria {\n implicitLocation {\n id\n localizedDisplayName\n type\n __typename\n }\n keyword\n location {\n id\n shortName\n localizedShortName\n localizedDisplayName\n type\n __typename\n }\n __typename\n }\n footerVO {\n countryMenu {\n childNavigationLinks {\n id\n link\n textKey\n __typename\n }\n __typename\n }\n __typename\n }\n helpCenterDomain\n helpCenterLocale\n jobAlert {\n jobAlertExists\n __typename\n }\n jobSerpFaq {\n questions {\n answer\n question\n __typename\n }\n __typename\n }\n jobSerpJobOutlook {\n occupation\n paragraph\n __typename\n }\n showMachineReadableJobs\n __typename\n }\n serpSeoLinksVO {\n relatedJobTitlesResults\n searchedJobTitle\n searchedKeyword\n searchedLocationIdAsString\n searchedLocationSeoName\n searchedLocationType\n topCityIdsToNameResults {\n key\n value\n __typename\n }\n topEmployerIdsToNameResults {\n key\n value\n __typename\n }\n topEmployerNameResults\n topOccupationResults\n __typename\n }\n totalJobsCount\n __typename\n }\n}\n\nfragment JobView on JobListingSearchResult {\n jobview {\n header {\n adOrderId\n advertiserType\n adOrderSponsorshipLevel\n ageInDays\n divisionEmployerName\n easyApply\n employer {\n id\n name\n shortName\n __typename\n }\n employerNameFromSearch\n goc\n gocConfidence\n gocId\n jobCountryId\n jobLink\n jobResultTrackingKey\n jobTitleText\n locationName\n locationType\n locId\n needsCommission\n payCurrency\n payPeriod\n payPeriodAdjustedPay {\n p10\n p50\n p90\n __typename\n }\n rating\n salarySource\n savedJobId\n sponsored\n __typename\n }\n job {\n descriptionFragments\n importConfigId\n jobTitleId\n jobTitleText\n listingId\n __typename\n }\n jobListingAdminDetails {\n cpcVal\n importConfigId\n jobListingId\n jobSourceId\n userEligibleForAdminJobDetails\n __typename\n }\n overview {\n shortName\n squareLogoUrl\n __typename\n }\n __typename\n }\n __typename\n}\n",
}
job_type_filters = {
JobType.FULL_TIME: "fulltime",
JobType.PART_TIME: "parttime",
JobType.CONTRACT: "contract",
JobType.INTERNSHIP: "internship",
JobType.TEMPORARY: "temporary",
}
if scraper_input.job_type in job_type_filters:
filter_value = job_type_filters[scraper_input.job_type]
payload["variables"]["filterParams"].append(
{"filterKey": "jobType", "values": filter_value}
)
return json.dumps([payload])
def parse_location(self, location_name: str) -> Location:
if not location_name or location_name == "Remote":
return None
city, _, state = location_name.partition(", ")
return Location(city=city, state=state)
@staticmethod
def get_cursor_for_page(pagination_cursors, page_num):
for cursor_data in pagination_cursors:
if cursor_data["pageNumber"] == page_num:
return cursor_data["cursor"]
return None
@staticmethod
def headers() -> dict:
"""
Returns headers needed for requests
:return: dict - Dictionary containing headers
"""
return {
"authority": "www.glassdoor.com",
"accept": "*/*",
"accept-language": "en-US,en;q=0.9",
"apollographql-client-name": "job-search-next",
"apollographql-client-version": "4.65.5",
"content-type": "application/json",
"cookie": 'gdId=91e2dfc4-c8b5-4fa7-83d0-11512b80262c; G_ENABLED_IDPS=google; trs=https%3A%2F%2Fwww.redhat.com%2F:referral:referral:2023-07-05+09%3A50%3A14.862:undefined:undefined; g_state={"i_p":1688587331651,"i_l":1}; _cfuvid=.7llazxhYFZWi6EISSPdVjtqF0NMVwzxr_E.cB1jgLs-1697828392979-0-604800000; GSESSIONID=undefined; JSESSIONID=F03DD1B5EE02DB6D842FE42B142F88F3; cass=1; jobsClicked=true; indeedCtk=1hd77b301k79i801; asst=1697829114.2; G_AUTHUSER_H=0; uc=8013A8318C98C517FE6DD0024636DFDEF978FC33266D93A2FAFEF364EACA608949D8B8FA2DC243D62DE271D733EB189D809ABE5B08D7B1AE865D217BD4EEBB97C282F5DA5FEFE79C937E3F6110B2A3A0ADBBA3B4B6DF5A996FEE00516100A65FCB11DA26817BE8D1C1BF6CFE36B5B68A3FDC2CFEC83AB797F7841FBB157C202332FC7E077B56BD39B167BDF3D9866E3B; AWSALB=zxc/Yk1nbWXXT6HjNyn3H4h4950ckVsFV/zOrq5LSoChYLE1qV+hDI8Axi3fUa9rlskndcO0M+Fw+ZnJ+AQ2afBFpyOd1acouLMYgkbEpqpQaWhY6/Gv4QH1zBcJ; AWSALBCORS=zxc/Yk1nbWXXT6HjNyn3H4h4950ckVsFV/zOrq5LSoChYLE1qV+hDI8Axi3fUa9rlskndcO0M+Fw+ZnJ+AQ2afBFpyOd1acouLMYgkbEpqpQaWhY6/Gv4QH1zBcJ; gdsid=1697828393025:1697830776351:668396EDB9E6A832022D34414128093D; at=HkH8Hnqi9uaMC7eu0okqyIwqp07ht9hBvE1_St7E_hRqPvkO9pUeJ1Jcpds4F3g6LL5ADaCNlxrPn0o6DumGMfog8qI1-zxaV_jpiFs3pugntw6WpVyYWdfioIZ1IDKupyteeLQEM1AO4zhGjY_rPZynpsiZBPO_B1au94sKv64rv23yvP56OiWKKfI-8_9hhLACEwWvM-Az7X-4aE2QdFt93VJbXbbGVf07bdDZfimsIkTtgJCLSRhU1V0kEM1Efyu66vo3m77gFFaMW7lxyYnb36I5PdDtEXBm3aL-zR7-qa5ywd94ISEivgqQOA4FPItNhqIlX4XrfD1lxVz6rfPaoTIDi4DI6UMCUjwyPsuv8mn0rYqDfRnmJpZ97fJ5AnhrknAd_6ZWN5v1OrxJczHzcXd8LO820QPoqxzzG13bmSTXLwGSxMUCtSrVsq05hicimQ3jpRt0c1dA4OkTNqF7_770B9JfcHcM8cr8-C4IL56dnOjr9KBGfN1Q2IvZM2cOBRbV7okiNOzKVZ3qJ24AE34WA2F3U6Whiu6H8nIuGG5hSNkVygY6CtglNZfFF9p8pJAZm79PngrrBv-CXFBZmhYLFo46lmFetDkiJ6mirtez4tKpzTIYjIp4_JAkiZFwbLJ2QGH4mK8kyyW0lZiX1DTuQec50N_5wvRo0Gt7nlKxzLsApMnaNhuQeH5ygh_pa381ORo9mQGi0EYF9zk00pa2--z4PtjfQ8KFq36GgpxKy5-o4qgqygZj8F01L8r-FiX2G4C7PREMIpAyHX2A4-_JxA1IS2j12EyqKTLqE9VcP06qm2Z-YuIW3ctmpMxy5G9_KiEiGv17weizhSFnl6SbpAEY-2VSmQ5V6jm3hoMp2jemkuGCRkZeFstLDEPxlzFN7WM; __cf_bm=zGaVjIJw4irf40_7UVw54B6Ohm271RUX4Tc8KVScrbs-1697830777-0-AYv2GnKTnnCU+cY9xHbJunO0DwlLDO6SIBnC/s/qldpKsGK0rRAjD6y8lbyATT/KlS7g29OZaN4fbd0lrJg0KmWbIybZIzfWVLHSYePVuOhu; asst=1697829114.2; at=dFhXf64wsf2TlnWy41xLs7skJkuxgKToEGcjGtDfUvW4oEAJ4tTIR5dKQ8wbwT75aIaGgdCfvcb-da7vwrCGWscCncmfLFQpJ9l-LLwoRfk-pMsxHhd77wvf-W7I0HSm7-Q5lQJqI9WyNGRxOa-RpzBTf4L8_Et4-3FzjPaAoYY5pY1FhuwXbN5asGOAMW-p8cjpbfn3PumlIYuckguWnjrcY2F31YJ_1noeoHM9tCGpymANbqGXRkG6aXY7yCfVXtdgZU1K5SMeaSPZIuF_iLUxjc_corzpNiH6qq7BIAmh-e5Aa-g7cwpZcln1fmwTVw4uTMZf1eLIMTa9WzgqZNkvG-sGaq_XxKA_Wai6xTTkOHfRgm4632Ba2963wdJvkGmUUa3tb_L4_wTgk3eFnHp5JhghLfT2Pe3KidP-yX__vx8JOsqe3fndCkKXgVz7xQKe1Dur-sMNlGwi4LXfguTT2YUI8C5Miq3pj2IHc7dC97eyyAiAM4HvyGWfaXWZcei6oIGrOwMvYgy0AcwFry6SIP2SxLT5TrxinRRuem1r1IcOTJsMJyUPp1QsZ7bOyq9G_0060B4CPyovw5523hEuqLTM-R5e5yavY6C_1DHUyE15C3mrh7kdvmlGZeflnHqkFTEKwwOftm-Mv-CKD5Db9ABFGNxKB2FH7nDH67hfOvm4tGNMzceBPKYJ3wciTt9jK3wy39_7cOYVywfrZ-oLhw_XtsbGSSeGn3HytrfgSADAh2sT0Gg6eCC9Xy1vh-Za337SVLUDXZ73W2xJxxUHBkFzZs8L_Xndo5DsbpWhVs9IYUGyraJdqB3SLgDbAppIBCJl4fx6_DG8-xOQPBvuFMlTROe1JVdHOzXI1GElwFDTuH1pjkg4I2G0NhAbE06Y-1illQE; gdsid=1697828393025:1697831731408:99C30D94108AC3030D61C736DDCDF11C',
"gd-csrf-token": "Ft6oHEWlRZrxDww95Cpazw:0pGUrkb2y3TyOpAIqF2vbPmUXoXVkD3oEGDVkvfeCerceQ5-n8mBg3BovySUIjmCPHCaW0H2nQVdqzbtsYqf4Q:wcqRqeegRUa9MVLJGyujVXB7vWFPjdaS1CtrrzJq-ok",
"origin": "https://www.glassdoor.com",
"referer": "https://www.glassdoor.com/",
"sec-ch-ua": '"Chromium";v="118", "Google Chrome";v="118", "Not=A?Brand";v="99"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"macOS"',
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-origin",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36",
}

View File

@@ -56,8 +56,9 @@ class IndeedScraper(Scraper):
:return: jobs found on page, total number of jobs found for search :return: jobs found on page, total number of jobs found for search
""" """
self.country = scraper_input.country self.country = scraper_input.country
domain = self.country.indeed_domain_value domain = self.country.domain_value
self.url = f"https://{domain}.indeed.com" self.url = f"https://{domain}.indeed.com"
session = create_session(self.proxy)
params = { params = {
"q": scraper_input.search_term, "q": scraper_input.search_term,
@@ -77,7 +78,6 @@ class IndeedScraper(Scraper):
if sc_values: if sc_values:
params["sc"] = "0kf:" + "".join(sc_values) + ";" params["sc"] = "0kf:" + "".join(sc_values) + ";"
try: try:
session = create_session(self.proxy, is_tls=True)
response = session.get( response = session.get(
f"{self.url}/jobs", f"{self.url}/jobs",
headers=self.get_headers(), headers=self.get_headers(),
@@ -258,8 +258,12 @@ class IndeedScraper(Scraper):
except (KeyError, TypeError, IndexError): except (KeyError, TypeError, IndexError):
return None return None
soup = BeautifulSoup(job_description, "html.parser") soup = BeautifulSoup(
text_content = " ".join(soup.get_text(separator=" ").split()).strip() job_description, "html.parser"
)
text_content = " ".join(
soup.get_text(separator=" ").split()
).strip()
return text_content return text_content

View File

@@ -10,15 +10,20 @@ from datetime import datetime
import requests import requests
import time import time
from requests.exceptions import ProxyError from requests.exceptions import ProxyError
from concurrent.futures import ThreadPoolExecutor, as_completed
from bs4 import BeautifulSoup from bs4 import BeautifulSoup
from bs4.element import Tag from bs4.element import Tag
from threading import Lock from threading import Lock
from urllib.parse import urlparse, urlunparse
from .. import Scraper, ScraperInput, Site from .. import Scraper, ScraperInput, Site
from ..utils import count_urgent_words, extract_emails_from_text, get_enum_from_job_type from ..utils import count_urgent_words, extract_emails_from_text, get_enum_from_job_type
from ..exceptions import LinkedInException from ..exceptions import LinkedInException
from ...jobs import JobPost, Location, JobResponse, JobType, Country from ...jobs import (
JobPost,
Location,
JobResponse,
JobType,
)
class LinkedInScraper(Scraper): class LinkedInScraper(Scraper):
@@ -66,10 +71,12 @@ class LinkedInScraper(Scraper):
if scraper_input.job_type if scraper_input.job_type
else None, else None,
"pageNum": 0, "pageNum": 0,
"start": page + scraper_input.offset, page: page + scraper_input.offset,
"f_AL": "true" if scraper_input.easy_apply else None, "f_AL": "true" if scraper_input.easy_apply else None,
} }
params = {k: v for k, v in params.items() if v is not None}
params = {k: v for k, v in params.items() if v is not None} params = {k: v for k, v in params.items() if v is not None}
retries = 0 retries = 0
while retries < self.MAX_RETRIES: while retries < self.MAX_RETRIES:
@@ -86,7 +93,7 @@ class LinkedInScraper(Scraper):
break break
except requests.HTTPError as e: except requests.HTTPError as e:
if hasattr(e, "response") and e.response is not None: if hasattr(e, "response") and e.response is not None:
if e.response.status_code in (429, 502): if e.response.status_code == 429:
time.sleep(self.DELAY) time.sleep(self.DELAY)
retries += 1 retries += 1
continue continue
@@ -108,27 +115,32 @@ class LinkedInScraper(Scraper):
soup = BeautifulSoup(response.text, "html.parser") soup = BeautifulSoup(response.text, "html.parser")
for job_card in soup.find_all("div", class_="base-search-card"): with ThreadPoolExecutor(max_workers=5) as executor:
job_url = None futures = []
href_tag = job_card.find("a", class_="base-card__full-link") for job_card in soup.find_all("div", class_="base-search-card"):
if href_tag and "href" in href_tag.attrs: job_url = None
href = href_tag.attrs["href"].split("?")[0] href_tag = job_card.find("a", class_="base-card__full-link")
job_id = href.split("-")[-1] if href_tag and "href" in href_tag.attrs:
job_url = f"{self.url}/jobs/view/{job_id}" href = href_tag.attrs["href"].split("?")[0]
job_id = href.split("-")[-1]
job_url = f"{self.url}/jobs/view/{job_id}"
with url_lock: with url_lock:
if job_url in seen_urls: if job_url in seen_urls:
continue continue
seen_urls.add(job_url) seen_urls.add(job_url)
# Call process_job directly without threading futures.append(executor.submit(self.process_job, job_card, job_url))
try:
job_post = self.process_job(job_card, job_url)
if job_post:
job_list.append(job_post)
except Exception as e:
raise LinkedInException("Exception occurred while processing jobs")
for future in as_completed(futures):
try:
job_post = future.result()
if job_post:
job_list.append(job_post)
except Exception as e:
raise LinkedInException(
"Exception occurred while processing jobs"
)
page += 25 page += 25
job_list = job_list[: scraper_input.results_wanted] job_list = job_list[: scraper_input.results_wanted]
@@ -140,11 +152,6 @@ class LinkedInScraper(Scraper):
company_tag = job_card.find("h4", class_="base-search-card__subtitle") company_tag = job_card.find("h4", class_="base-search-card__subtitle")
company_a_tag = company_tag.find("a") if company_tag else None company_a_tag = company_tag.find("a") if company_tag else None
company_url = (
urlunparse(urlparse(company_a_tag.get("href"))._replace(query=""))
if company_a_tag and company_a_tag.has_attr("href")
else ""
)
company = company_a_tag.get_text(strip=True) if company_a_tag else "N/A" company = company_a_tag.get_text(strip=True) if company_a_tag else "N/A"
metadata_card = job_card.find("div", class_="base-search-card__metadata") metadata_card = job_card.find("div", class_="base-search-card__metadata")
@@ -166,16 +173,15 @@ class LinkedInScraper(Scraper):
benefits = " ".join(benefits_tag.get_text().split()) if benefits_tag else None benefits = " ".join(benefits_tag.get_text().split()) if benefits_tag else None
description, job_type = self.get_job_description(job_url) description, job_type = self.get_job_description(job_url)
# description, job_type = None, []
return JobPost( return JobPost(
title=title, title=title,
description=description, description=description,
company_name=company, company_name=company,
company_url=company_url,
location=location, location=location,
date_posted=date_posted, date_posted=date_posted,
job_url=job_url, job_url=job_url,
# job_type=[JobType.FULL_TIME],
job_type=job_type, job_type=job_type,
benefits=benefits, benefits=benefits,
emails=extract_emails_from_text(description) if description else None, emails=extract_emails_from_text(description) if description else None,
@@ -193,15 +199,8 @@ class LinkedInScraper(Scraper):
try: try:
response = requests.get(job_page_url, timeout=5, proxies=self.proxy) response = requests.get(job_page_url, timeout=5, proxies=self.proxy)
response.raise_for_status() response.raise_for_status()
except requests.HTTPError as e:
if hasattr(e, "response") and e.response is not None:
if e.response.status_code in (429, 502):
time.sleep(self.DELAY)
return None, None
except Exception as e: except Exception as e:
return None, None return None, None
if response.url == "https://www.linkedin.com/signup":
return None, None
soup = BeautifulSoup(response.text, "html.parser") soup = BeautifulSoup(response.text, "html.parser")
div_content = soup.find( div_content = soup.find(
@@ -237,7 +236,7 @@ class LinkedInScraper(Scraper):
employment_type = employment_type.lower() employment_type = employment_type.lower()
employment_type = employment_type.replace("-", "") employment_type = employment_type.replace("-", "")
return [get_enum_from_job_type(employment_type)] if employment_type else [] return [get_enum_from_job_type(employment_type)]
return description, get_job_type(soup) return description, get_job_type(soup)
@@ -247,7 +246,7 @@ class LinkedInScraper(Scraper):
:param metadata_card :param metadata_card
:return: location :return: location
""" """
location = Location(country=Country.from_string(self.country)) location = Location(country=self.country)
if metadata_card is not None: if metadata_card is not None:
location_tag = metadata_card.find( location_tag = metadata_card.find(
"span", class_="job-search-card__location" "span", class_="job-search-card__location"
@@ -259,7 +258,7 @@ class LinkedInScraper(Scraper):
location = Location( location = Location(
city=city, city=city,
state=state, state=state,
country=Country.from_string(self.country), country=self.country,
) )
return location return location

View File

@@ -1,6 +1,4 @@
import re import re
import requests
import tls_client import tls_client
from ..jobs import JobType from ..jobs import JobType
@@ -26,29 +24,23 @@ def extract_emails_from_text(text: str) -> list[str] | None:
return email_regex.findall(text) return email_regex.findall(text)
def create_session(proxy: dict | None = None, is_tls: bool = True): def create_session(proxy: str | None = None):
""" """
Creates a tls client session Creates a tls client session
:return: A session object with or without proxies. :return: A session object with or without proxies.
""" """
if is_tls: session = tls_client.Session(
session = tls_client.Session( client_identifier="chrome112",
client_identifier="chrome112", random_tls_extension_order=True,
random_tls_extension_order=True, )
) session.proxies = proxy
session.proxies = proxy # TODO multiple proxies
# TODO multiple proxies # if self.proxies:
# if self.proxies: # session.proxies = {
# session.proxies = { # "http": random.choice(self.proxies),
# "http": random.choice(self.proxies), # "https": random.choice(self.proxies),
# "https": random.choice(self.proxies), # }
# }
else:
session = requests.Session()
session.allow_redirects = True
if proxy:
session.proxies.update(proxy)
return session return session

View File

@@ -5,24 +5,35 @@ jobspy.scrapers.ziprecruiter
This module contains routines to scrape ZipRecruiter. This module contains routines to scrape ZipRecruiter.
""" """
import math import math
import time import json
import re import re
from datetime import datetime, date from datetime import datetime, date
from typing import Optional, Tuple, Any from typing import Optional, Tuple, Any
from urllib.parse import urlparse, parse_qs, urlunparse
import requests
from bs4 import BeautifulSoup from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor from bs4.element import Tag
from concurrent.futures import ThreadPoolExecutor, Future
from .. import Scraper, ScraperInput, Site from .. import Scraper, ScraperInput, Site
from ..exceptions import ZipRecruiterException from ..exceptions import ZipRecruiterException
from ..utils import count_urgent_words, extract_emails_from_text, create_session from ..utils import count_urgent_words, extract_emails_from_text, create_session
from ...jobs import JobPost, Compensation, Location, JobResponse, JobType, Country from ...jobs import (
JobPost,
Compensation,
CompensationInterval,
Location,
JobResponse,
JobType,
Country,
)
class ZipRecruiterScraper(Scraper): class ZipRecruiterScraper(Scraper):
def __init__(self, proxy: Optional[str] = None): def __init__(self, proxy: Optional[str] = None):
""" """
Initializes ZipRecruiterScraper with the ZipRecruiter job search url Initializes LinkedInScraper with the ZipRecruiter job search url
""" """
site = Site(Site.ZIP_RECRUITER) site = Site(Site.ZIP_RECRUITER)
self.url = "https://www.ziprecruiter.com" self.url = "https://www.ziprecruiter.com"
@@ -31,24 +42,21 @@ class ZipRecruiterScraper(Scraper):
self.jobs_per_page = 20 self.jobs_per_page = 20
self.seen_urls = set() self.seen_urls = set()
def find_jobs_in_page( def find_jobs_in_page(self, scraper_input: ScraperInput, continue_token: Optional[str] = None) -> Tuple[list[JobPost], Optional[str]]:
self, scraper_input: ScraperInput, continue_token: str | None = None
) -> Tuple[list[JobPost], Optional[str]]:
""" """
Scrapes a page of ZipRecruiter for jobs with scraper_input criteria Scrapes a page of ZipRecruiter for jobs with scraper_input criteria
:param scraper_input: :param scraper_input:
:param continue_token:
:return: jobs found on page :return: jobs found on page
""" """
params = self.add_params(scraper_input) params = self.add_params(scraper_input)
if continue_token: if continue_token:
params["continue"] = continue_token params['continue'] = continue_token
try: try:
session = create_session(self.proxy, is_tls=False) response = requests.get(
response = session.get(
f"https://api.ziprecruiter.com/jobs-app/jobs", f"https://api.ziprecruiter.com/jobs-app/jobs",
headers=self.headers(), headers=self.headers(),
params=self.add_params(scraper_input), params=self.add_params(scraper_input),
allow_redirects=True,
timeout=10, timeout=10,
) )
if response.status_code != 200: if response.status_code != 200:
@@ -60,13 +68,15 @@ class ZipRecruiterScraper(Scraper):
raise ZipRecruiterException("bad proxy") raise ZipRecruiterException("bad proxy")
raise ZipRecruiterException(str(e)) raise ZipRecruiterException(str(e))
time.sleep(5)
response_data = response.json() response_data = response.json()
jobs_list = response_data.get("jobs", []) jobs_list = response_data.get("jobs", [])
next_continue_token = response_data.get("continue", None) next_continue_token = response_data.get('continue', None)
with ThreadPoolExecutor(max_workers=self.jobs_per_page) as executor: with ThreadPoolExecutor(max_workers=10) as executor:
job_results = [executor.submit(self.process_job, job) for job in jobs_list] job_results = [
executor.submit(self.process_job, job)
for job in jobs_list
]
job_list = [result.result() for result in job_results if result.result()] job_list = [result.result() for result in job_results if result.result()]
return job_list, next_continue_token return job_list, next_continue_token
@@ -86,9 +96,7 @@ class ZipRecruiterScraper(Scraper):
if len(job_list) >= scraper_input.results_wanted: if len(job_list) >= scraper_input.results_wanted:
break break
jobs_on_page, continue_token = self.find_jobs_in_page( jobs_on_page, continue_token = self.find_jobs_in_page(scraper_input, continue_token)
scraper_input, continue_token
)
if jobs_on_page: if jobs_on_page:
job_list.extend(jobs_on_page) job_list.extend(jobs_on_page)
@@ -96,13 +104,12 @@ class ZipRecruiterScraper(Scraper):
break break
if len(job_list) > scraper_input.results_wanted: if len(job_list) > scraper_input.results_wanted:
job_list = job_list[: scraper_input.results_wanted] job_list = job_list[:scraper_input.results_wanted]
return JobResponse(jobs=job_list) return JobResponse(jobs=job_list)
@staticmethod def process_job(self, job: dict) -> JobPost:
def process_job(job: dict) -> JobPost: """the most common type of jobs page on ZR"""
"""Processes an individual job dict from the response"""
title = job.get("name") title = job.get("name")
job_url = job.get("job_url") job_url = job.get("job_url")
@@ -110,12 +117,9 @@ class ZipRecruiterScraper(Scraper):
job.get("job_description", "").strip(), "html.parser" job.get("job_description", "").strip(), "html.parser"
).get_text() ).get_text()
company = job["hiring_company"].get("name") if "hiring_company" in job else None company = job['hiring_company'].get("name") if "hiring_company" in job else None
country_value = "usa" if job.get("job_country") == "US" else "canada"
country_enum = Country.from_string(country_value)
location = Location( location = Location(
city=job.get("job_city"), state=job.get("job_state"), country=country_enum city=job.get("job_city"), state=job.get("job_state"), country='usa' if job.get("job_country") == 'US' else 'canada'
) )
job_type = ZipRecruiterScraper.get_job_type_enum( job_type = ZipRecruiterScraper.get_job_type_enum(
job.get("employment_type", "").replace("_", "").lower() job.get("employment_type", "").replace("_", "").lower()
@@ -132,21 +136,16 @@ class ZipRecruiterScraper(Scraper):
else: else:
date_posted = date.today() date_posted = date.today()
return JobPost( return JobPost(
title=title, title=title,
company_name=company, company_name=company,
location=location, location=location,
job_type=job_type, job_type=job_type,
compensation=Compensation( compensation=Compensation(
interval="yearly" interval="yearly" if job.get("compensation_interval") == "annual" else job.get("compensation_interval") ,
if job.get("compensation_interval") == "annual" min_amount=int(job["compensation_min"]) if "compensation_min" in job else None,
else job.get("compensation_interval"), max_amount=int(job["compensation_max"]) if "compensation_max" in job else None,
min_amount=int(job["compensation_min"])
if "compensation_min" in job
else None,
max_amount=int(job["compensation_max"])
if "compensation_max" in job
else None,
currency=job.get("compensation_currency"), currency=job.get("compensation_currency"),
), ),
date_posted=date_posted, date_posted=date_posted,
@@ -192,6 +191,107 @@ class ZipRecruiterScraper(Scraper):
return params return params
@staticmethod
def get_interval(interval_str: str):
"""
Maps the interval alias to its appropriate CompensationInterval.
:param interval_str
:return: CompensationInterval
"""
interval_alias = {"annually": CompensationInterval.YEARLY}
interval_str = interval_str.lower()
if interval_str in interval_alias:
return interval_alias[interval_str]
return CompensationInterval(interval_str)
@staticmethod
def get_date_posted(job: Tag) -> Optional[datetime.date]:
"""
Extracts the date a job was posted
:param job
:return: date the job was posted or None
"""
button = job.find(
"button", {"class": "action_input save_job zrs_btn_secondary_200"}
)
if not button:
return None
url_time = button.get("data-href", "")
url_components = urlparse(url_time)
params = parse_qs(url_components.query)
posted_time_str = params.get("posted_time", [None])[0]
if posted_time_str:
posted_date = datetime.strptime(
posted_time_str, "%Y-%m-%dT%H:%M:%SZ"
).date()
return posted_date
return None
@staticmethod
def get_compensation(job: Tag) -> Optional[Compensation]:
"""
Parses the compensation tag from the job BeautifulSoup object
:param job
:return: Compensation object or None
"""
pay_element = job.find("li", {"class": "perk_item perk_pay"})
if pay_element is None:
return None
pay = pay_element.find("div", {"class": "value"}).find("span").text.strip()
def create_compensation_object(pay_string: str) -> Compensation:
"""
Creates a Compensation object from a pay_string
:param pay_string
:return: compensation
"""
interval = ZipRecruiterScraper.get_interval(pay_string.split()[-1])
amounts = []
for amount in pay_string.split("to"):
amount = amount.replace(",", "").strip("$ ").split(" ")[0]
if "K" in amount:
amount = amount.replace("K", "")
amount = int(float(amount)) * 1000
else:
amount = int(float(amount))
amounts.append(amount)
compensation = Compensation(
interval=interval,
min_amount=min(amounts),
max_amount=max(amounts),
currency="USD/CAD",
)
return compensation
return create_compensation_object(pay)
@staticmethod
def get_location(job: Tag) -> Location:
"""
Extracts the job location from BeatifulSoup object
:param job:
:return: location
"""
location_link = job.find("a", {"class": "company_location"})
if location_link is not None:
location_string = location_link.text.strip()
parts = location_string.split(", ")
if len(parts) == 2:
city, state = parts
else:
city, state = None, None
else:
city, state = None, None
return Location(city=city, state=state, country=Country.US_CANADA)
@staticmethod @staticmethod
def headers() -> dict: def headers() -> dict:
""" """
@@ -199,13 +299,13 @@ class ZipRecruiterScraper(Scraper):
:return: dict - Dictionary containing headers :return: dict - Dictionary containing headers
""" """
return { return {
"Host": "api.ziprecruiter.com", 'Host': 'api.ziprecruiter.com',
"Cookie": "ziprecruiter_browser=018188e0-045b-4ad7-aa50-627a6c3d43aa; ziprecruiter_session=5259b2219bf95b6d2299a1417424bc2edc9f4b38; SplitSV=2016-10-19%3AU2FsdGVkX19f9%2Bx70knxc%2FeR3xXR8lWoTcYfq5QjmLU%3D%0A; __cf_bm=qXim3DtLPbOL83GIp.ddQEOFVFTc1OBGPckiHYxcz3o-1698521532-0-AfUOCkgCZyVbiW1ziUwyefCfzNrJJTTKPYnif1FZGQkT60dMowmSU/Y/lP+WiygkFPW/KbYJmyc+MQSkkad5YygYaARflaRj51abnD+SyF9V; zglobalid=68d49bd5-0326-428e-aba8-8a04b64bc67c.af2d99ff7c03.653d61bb; ziprecruiter_browser=018188e0-045b-4ad7-aa50-627a6c3d43aa; ziprecruiter_session=5259b2219bf95b6d2299a1417424bc2edc9f4b38", 'Cookie': 'ziprecruiter_browser=018188e0-045b-4ad7-aa50-627a6c3d43aa; ziprecruiter_session=5259b2219bf95b6d2299a1417424bc2edc9f4b38; SplitSV=2016-10-19%3AU2FsdGVkX19f9%2Bx70knxc%2FeR3xXR8lWoTcYfq5QjmLU%3D%0A; __cf_bm=qXim3DtLPbOL83GIp.ddQEOFVFTc1OBGPckiHYxcz3o-1698521532-0-AfUOCkgCZyVbiW1ziUwyefCfzNrJJTTKPYnif1FZGQkT60dMowmSU/Y/lP+WiygkFPW/KbYJmyc+MQSkkad5YygYaARflaRj51abnD+SyF9V; zglobalid=68d49bd5-0326-428e-aba8-8a04b64bc67c.af2d99ff7c03.653d61bb; ziprecruiter_browser=018188e0-045b-4ad7-aa50-627a6c3d43aa; ziprecruiter_session=5259b2219bf95b6d2299a1417424bc2edc9f4b38',
"accept": "*/*", 'accept': '*/*',
"x-zr-zva-override": "100000000;vid:ZT1huzm_EQlDTVEc", 'x-zr-zva-override': '100000000;vid:ZT1huzm_EQlDTVEc',
"x-pushnotificationid": "0ff4983d38d7fc5b3370297f2bcffcf4b3321c418f5c22dd152a0264707602a0", 'x-pushnotificationid': '0ff4983d38d7fc5b3370297f2bcffcf4b3321c418f5c22dd152a0264707602a0',
"x-deviceid": "D77B3A92-E589-46A4-8A39-6EF6F1D86006", 'x-deviceid': 'D77B3A92-E589-46A4-8A39-6EF6F1D86006',
"user-agent": "Job Search/87.0 (iPhone; CPU iOS 16_6_1 like Mac OS X)", 'user-agent': 'Job Search/87.0 (iPhone; CPU iOS 16_6_1 like Mac OS X)',
"authorization": "Basic YTBlZjMyZDYtN2I0Yy00MWVkLWEyODMtYTI1NDAzMzI0YTcyOg==", 'authorization': 'Basic YTBlZjMyZDYtN2I0Yy00MWVkLWEyODMtYTI1NDAzMzI0YTcyOg==',
"accept-language": "en-US,en;q=0.9", 'accept-language': 'en-US,en;q=0.9'
} }

View File

@@ -4,7 +4,7 @@ import pandas as pd
def test_all(): def test_all():
result = scrape_jobs( result = scrape_jobs(
site_name=["linkedin", "indeed", "zip_recruiter", "glassdoor"], site_name=["linkedin", "indeed", "zip_recruiter"],
search_term="software engineer", search_term="software engineer",
results_wanted=5, results_wanted=5,
) )

View File

@@ -1,11 +0,0 @@
from ..jobspy import scrape_jobs
import pandas as pd
def test_indeed():
result = scrape_jobs(
site_name="glassdoor", search_term="software engineer", country_indeed="USA"
)
assert (
isinstance(result, pd.DataFrame) and not result.empty
), "Result should be a non-empty DataFrame"

View File

@@ -4,7 +4,8 @@ import pandas as pd
def test_indeed(): def test_indeed():
result = scrape_jobs( result = scrape_jobs(
site_name="indeed", search_term="software engineer", country_indeed="usa" site_name="indeed",
search_term="software engineer",
) )
assert ( assert (
isinstance(result, pd.DataFrame) and not result.empty isinstance(result, pd.DataFrame) and not result.empty