Compare commits

..

3 Commits

Author SHA1 Message Date
Cullen Watson
9c43f82fb1 pass test 2024-10-19 18:01:02 -05:00
Cullen Watson
6ba571f5e4 pass test 2024-10-19 17:58:26 -05:00
Cullen Watson
b43289fa38 indeed:remove tpe 2024-10-19 17:55:36 -05:00
25 changed files with 1151 additions and 1684 deletions

View File

@@ -1,50 +1,33 @@
name: Publish Python 🐍 distributions 📦 to PyPI name: Publish Python 🐍 distributions 📦 to PyPI
on: on: push
pull_request:
types:
- closed
permissions:
contents: write
jobs: jobs:
build-n-publish: build-n-publish:
name: Build and publish Python 🐍 distributions 📦 to PyPI name: Build and publish Python 🐍 distributions 📦 to PyPI
runs-on: ubuntu-latest runs-on: ubuntu-latest
if: github.event.pull_request.merged == true && github.event.pull_request.base.ref == 'main'
steps: steps:
- uses: actions/checkout@v3 - uses: actions/checkout@v3
- name: Set up Python - name: Set up Python
uses: actions/setup-python@v4 uses: actions/setup-python@v4
with: with:
python-version: "3.10" python-version: "3.10"
- name: Install dependencies
run: pip install toml
- name: Increment version
run: python increment_version.py
- name: Commit version increment
run: |
git config --global user.name 'github-actions'
git config --global user.email 'github-actions@github.com'
git add pyproject.toml
git commit -m 'Increment version'
- name: Push changes
run: git push
- name: Install poetry - name: Install poetry
run: pip install poetry --user run: >-
python3 -m
pip install
poetry
--user
- name: Build distribution 📦 - name: Build distribution 📦
run: poetry build run: >-
python3 -m
poetry
build
- name: Publish distribution 📦 to PyPI - name: Publish distribution 📦 to PyPI
if: startsWith(github.ref, 'refs/tags')
uses: pypa/gh-action-pypi-publish@release/v1 uses: pypa/gh-action-pypi-publish@release/v1
with: with:
password: ${{ secrets.PYPI_API_TOKEN }} password: ${{ secrets.PYPI_API_TOKEN }}

22
.github/workflows/python-test.yml vendored Normal file
View File

@@ -0,0 +1,22 @@
name: Python Tests
on:
pull_request:
branches:
- main
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.8'
- name: Install dependencies
run: |
pip install poetry
poetry install
- name: Run tests
run: poetry run pytest tests/test_all.py

144
README.md
View File

@@ -1,12 +1,17 @@
<img src="https://github.com/cullenwatson/JobSpy/assets/78247585/ae185b7e-e444-4712-8bb9-fa97f53e896b" width="400"> <img src="https://github.com/cullenwatson/JobSpy/assets/78247585/ae185b7e-e444-4712-8bb9-fa97f53e896b" width="400">
**JobSpy** is a job scraping library with the goal of aggregating all the jobs from popular job boards with one tool. **JobSpy** is a simple, yet comprehensive, job scraping library.
**Not technical?** Try out the web scraping tool on our site at [usejobspy.com](https://usejobspy.com).
*Looking to build a data-focused software product?* **[Book a call](https://bunsly.com/)** *to
work with us.*
## Features ## Features
- Scrapes job postings from **LinkedIn**, **Indeed**, **Glassdoor**, **Google**, **ZipRecruiter**, & **Bayt** concurrently - Scrapes job postings from **LinkedIn**, **Indeed**, **Glassdoor**, & **ZipRecruiter** simultaneously
- Aggregates the job postings in a dataframe - Aggregates the job postings in a Pandas DataFrame
- Proxies support to bypass blocking - Proxies support
![jobspy](https://github.com/cullenwatson/JobSpy/assets/78247585/ec7ef355-05f6-4fd3-8161-a817e31c5c57) ![jobspy](https://github.com/cullenwatson/JobSpy/assets/78247585/ec7ef355-05f6-4fd3-8161-a817e31c5c57)
@@ -25,16 +30,16 @@ import csv
from jobspy import scrape_jobs from jobspy import scrape_jobs
jobs = scrape_jobs( jobs = scrape_jobs(
site_name=["indeed", "linkedin", "zip_recruiter", "glassdoor", "google", "bayt"], site_name=["indeed", "linkedin", "zip_recruiter", "glassdoor"],
search_term="software engineer", search_term="software engineer",
google_search_term="software engineer jobs near San Francisco, CA since yesterday", location="Dallas, TX",
location="San Francisco, CA",
results_wanted=20, results_wanted=20,
hours_old=72, hours_old=72, # (only Linkedin/Indeed is hour specific, others round up to days old)
country_indeed='USA', country_indeed='USA', # only needed for indeed / glassdoor
# linkedin_fetch_description=True # gets more info such as description, direct job url (slower) # linkedin_fetch_description=True # get more info such as full description, direct job url for linkedin (slower)
# proxies=["208.195.175.46:65095", "208.195.175.45:65095", "localhost"], # proxies=["208.195.175.46:65095", "208.195.175.45:65095", "localhost"],
) )
print(f"Found {len(jobs)} jobs") print(f"Found {len(jobs)} jobs")
print(jobs.head()) print(jobs.head())
@@ -58,13 +63,10 @@ zip_recruiter Software Developer TEKsystems Phoenix
```plaintext ```plaintext
Optional Optional
├── site_name (list|str): ├── site_name (list|str):
| linkedin, zip_recruiter, indeed, glassdoor, google, bayt | linkedin, zip_recruiter, indeed, glassdoor
| (default is all) | (default is all four)
├── search_term (str) ├── search_term (str)
|
├── google_search_term (str)
| search term for google jobs. This is the only param for filtering google jobs.
├── location (str) ├── location (str)
@@ -78,13 +80,16 @@ Optional
| in format ['user:pass@host:port', 'localhost'] | in format ['user:pass@host:port', 'localhost']
| each job board scraper will round robin through the proxies | each job board scraper will round robin through the proxies
| |
├── ca_cert (str)
| path to CA Certificate file for proxies
├── is_remote (bool) ├── is_remote (bool)
├── results_wanted (int): ├── results_wanted (int):
| number of job results to retrieve for each site specified in 'site_name' | number of job results to retrieve for each site specified in 'site_name'
├── easy_apply (bool): ├── easy_apply (bool):
| filters for jobs that are hosted on the job board site (LinkedIn easy apply filter no longer works) | filters for jobs that are hosted on the job board site
├── description_format (str): ├── description_format (str):
| markdown, html (Format type of the job descriptions. Default is markdown.) | markdown, html (Format type of the job descriptions. Default is markdown.)
@@ -111,9 +116,6 @@ Optional
| |
├── enforce_annual_salary (bool): ├── enforce_annual_salary (bool):
| converts wages to annual salary | converts wages to annual salary
|
├── ca_cert (str)
| path to CA Certificate file for proxies
``` ```
``` ```
@@ -129,6 +131,46 @@ Optional
| - easy_apply | - easy_apply
``` ```
### JobPost Schema
```plaintext
JobPost
├── title
├── company
├── company_url
├── job_url
├── location
│ ├── country
│ ├── city
│ ├── state
├── description
├── job_type: fulltime, parttime, internship, contract
├── job_function
│ ├── interval: yearly, monthly, weekly, daily, hourly
│ ├── min_amount
│ ├── max_amount
│ ├── currency
│ └── salary_source: direct_data, description (parsed from posting)
├── date_posted
├── emails
└── is_remote
Linkedin specific
└── job_level
Linkedin & Indeed specific
└── company_industry
Indeed specific
├── company_country
├── company_addresses
├── company_employees_label
├── company_revenue_label
├── company_description
└── logo_photo_url
```
## Supported Countries for Job Searching ## Supported Countries for Job Searching
### **LinkedIn** ### **LinkedIn**
@@ -165,11 +207,6 @@ You can specify the following countries when searching on Indeed (use the exact
| United Arab Emirates | UK* | USA* | Uruguay | | United Arab Emirates | UK* | USA* | Uruguay |
| Venezuela | Vietnam* | | | | Venezuela | Vietnam* | | |
### **Bayt**
Bayt only uses the search_term parameter currently and searches internationally
## Notes ## Notes
* Indeed is the best scraper currently with no rate limiting. * Indeed is the best scraper currently with no rate limiting.
@@ -180,23 +217,7 @@ Bayt only uses the search_term parameter currently and searches internationally
--- ---
**Q: Why is Indeed giving unrelated roles?** **Q: Why is Indeed giving unrelated roles?**
**A:** Indeed searches the description too. **A:** Indeed is searching each one of your terms e.g. software intern, it searches software OR intern. Try search_term='"software intern"' in quotes for stricter searching
- use - to remove words
- "" for exact match
Example of a good Indeed query
```py
search_term='"engineering intern" software summer (java OR python OR c++) 2025 -tax -marketing'
```
This searches the description/title and must include software, summer, 2025, one of the languages, engineering intern exactly, no tax, no marketing.
---
**Q: No results when using "google"?**
**A:** You have to use super specific syntax. Search for google jobs on your browser and then whatever pops up in the google jobs search box after applying some filters is what you need to copy & paste into the google_search_term.
--- ---
@@ -208,41 +229,8 @@ This searches the description/title and must include software, summer, 2025, one
--- ---
### JobPost Schema **Q: Encountering issues with your queries?**
**A:** Try reducing the number of `results_wanted` and/or broadening the filters. If problems
persist, [submit an issue](https://github.com/Bunsly/JobSpy/issues).
```plaintext ---
JobPost
├── title
├── company
├── company_url
├── job_url
├── location
│ ├── country
│ ├── city
│ ├── state
├── description
├── job_type: fulltime, parttime, internship, contract
├── job_function
│ ├── interval: yearly, monthly, weekly, daily, hourly
│ ├── min_amount
│ ├── max_amount
│ ├── currency
│ └── salary_source: direct_data, description (parsed from posting)
├── date_posted
├── emails
└── is_remote
Linkedin specific
└── job_level
Linkedin & Indeed specific
└── company_industry
Indeed specific
├── company_country
├── company_addresses
├── company_employees_label
├── company_revenue_label
├── company_description
└── company_logo
```

View File

@@ -1,21 +0,0 @@
import toml
def increment_version(version):
major, minor, patch = map(int, version.split('.'))
patch += 1
return f"{major}.{minor}.{patch}"
# Load pyproject.toml
with open('pyproject.toml', 'r') as file:
pyproject = toml.load(file)
# Increment the version
current_version = pyproject['tool']['poetry']['version']
new_version = increment_version(current_version)
pyproject['tool']['poetry']['version'] = new_version
# Save the updated pyproject.toml
with open('pyproject.toml', 'w') as file:
toml.dump(pyproject, file)
print(f"Version updated from {current_version} to {new_version}")

1901
poetry.lock generated

File diff suppressed because it is too large Load Diff

2
poetry.toml Normal file
View File

@@ -0,0 +1,2 @@
[virtualenvs]
in-project = true

View File

@@ -1,21 +1,15 @@
[build-system]
requires = [ "poetry-core",]
build-backend = "poetry.core.masonry.api"
[tool.poetry] [tool.poetry]
name = "python-jobspy" name = "python-jobspy"
version = "1.1.76" version = "1.1.70"
description = "Job scraper for LinkedIn, Indeed, Glassdoor & ZipRecruiter" description = "Job scraper for LinkedIn, Indeed, Glassdoor & ZipRecruiter"
authors = [ "Zachary Hampton <zachary@bunsly.com>", "Cullen Watson <cullen@bunsly.com>",] authors = ["Zachary Hampton <zachary@bunsly.com>", "Cullen Watson <cullen@bunsly.com>"]
homepage = "https://github.com/Bunsly/JobSpy" homepage = "https://github.com/Bunsly/JobSpy"
readme = "README.md" readme = "README.md"
keywords = [ "jobs-scraper", "linkedin", "indeed", "glassdoor", "ziprecruiter",] keywords = ['jobs-scraper', 'linkedin', 'indeed', 'glassdoor', 'ziprecruiter']
[[tool.poetry.packages]]
include = "jobspy"
from = "src"
[tool.black] packages = [
line-length = 88 { include = "jobspy", from = "src" }
]
[tool.poetry.dependencies] [tool.poetry.dependencies]
python = "^3.10" python = "^3.10"
@@ -25,11 +19,19 @@ pandas = "^2.1.0"
NUMPY = "1.26.3" NUMPY = "1.26.3"
pydantic = "^2.3.0" pydantic = "^2.3.0"
tls-client = "^1.0.1" tls-client = "^1.0.1"
markdownify = "^0.13.1" markdownify = "^0.11.6"
regex = "^2024.4.28" regex = "^2024.4.28"
[tool.poetry.group.dev.dependencies] [tool.poetry.group.dev.dependencies]
pytest = "^7.4.1" pytest = "^7.4.1"
jupyter = "^1.0.0" jupyter = "^1.0.0"
black = "*" black = "*"
pre-commit = "*" pre-commit = "*"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
[tool.black]
line-length = 88

View File

@@ -9,23 +9,19 @@ from .scrapers.utils import set_logger_level, extract_salary, create_logger
from .scrapers.indeed import IndeedScraper from .scrapers.indeed import IndeedScraper
from .scrapers.ziprecruiter import ZipRecruiterScraper from .scrapers.ziprecruiter import ZipRecruiterScraper
from .scrapers.glassdoor import GlassdoorScraper from .scrapers.glassdoor import GlassdoorScraper
from .scrapers.google import GoogleJobsScraper
from .scrapers.linkedin import LinkedInScraper from .scrapers.linkedin import LinkedInScraper
from .scrapers.bayt import BaytScraper
from .scrapers import SalarySource, ScraperInput, Site, JobResponse, Country from .scrapers import SalarySource, ScraperInput, Site, JobResponse, Country
from .scrapers.exceptions import ( from .scrapers.exceptions import (
LinkedInException, LinkedInException,
IndeedException, IndeedException,
ZipRecruiterException, ZipRecruiterException,
GlassdoorException, GlassdoorException,
GoogleJobsException,
) )
def scrape_jobs( def scrape_jobs(
site_name: str | list[str] | Site | list[Site] | None = None, site_name: str | list[str] | Site | list[Site] | None = None,
search_term: str | None = None, search_term: str | None = None,
google_search_term: str | None = None,
location: str | None = None, location: str | None = None,
distance: int | None = 50, distance: int | None = 50,
is_remote: bool = False, is_remote: bool = False,
@@ -42,7 +38,7 @@ def scrape_jobs(
offset: int | None = 0, offset: int | None = 0,
hours_old: int = None, hours_old: int = None,
enforce_annual_salary: bool = False, enforce_annual_salary: bool = False,
verbose: int = 0, verbose: int = 2,
**kwargs, **kwargs,
) -> pd.DataFrame: ) -> pd.DataFrame:
""" """
@@ -54,8 +50,6 @@ def scrape_jobs(
Site.INDEED: IndeedScraper, Site.INDEED: IndeedScraper,
Site.ZIP_RECRUITER: ZipRecruiterScraper, Site.ZIP_RECRUITER: ZipRecruiterScraper,
Site.GLASSDOOR: GlassdoorScraper, Site.GLASSDOOR: GlassdoorScraper,
Site.GOOGLE: GoogleJobsScraper,
Site.BAYT: BaytScraper,
} }
set_logger_level(verbose) set_logger_level(verbose)
@@ -89,7 +83,6 @@ def scrape_jobs(
site_type=get_site_type(), site_type=get_site_type(),
country=country_enum, country=country_enum,
search_term=search_term, search_term=search_term,
google_search_term=google_search_term,
location=location, location=location,
distance=distance, distance=distance,
is_remote=is_remote, is_remote=is_remote,
@@ -220,8 +213,8 @@ def scrape_jobs(
"title", "title",
"company", "company",
"location", "location",
"date_posted",
"job_type", "job_type",
"date_posted",
"salary_source", "salary_source",
"interval", "interval",
"min_amount", "min_amount",
@@ -230,12 +223,12 @@ def scrape_jobs(
"is_remote", "is_remote",
"job_level", "job_level",
"job_function", "job_function",
"company_industry",
"listing_type", "listing_type",
"emails", "emails",
"description", "description",
"company_industry",
"company_url", "company_url",
"company_logo", "logo_photo_url",
"company_url_direct", "company_url_direct",
"company_addresses", "company_addresses",
"company_num_employees", "company_num_employees",
@@ -252,8 +245,6 @@ def scrape_jobs(
jobs_df = jobs_df[desired_order] jobs_df = jobs_df[desired_order]
# Step 4: Sort the DataFrame as required # Step 4: Sort the DataFrame as required
return jobs_df.sort_values( return jobs_df.sort_values(by=["site", "date_posted"], ascending=[True, False])
by=["site", "date_posted"], ascending=[True, False]
).reset_index(drop=True)
else: else:
return pd.DataFrame() return pd.DataFrame()

View File

@@ -256,7 +256,7 @@ class JobPost(BaseModel):
company_num_employees: str | None = None company_num_employees: str | None = None
company_revenue: str | None = None company_revenue: str | None = None
company_description: str | None = None company_description: str | None = None
company_logo: str | None = None logo_photo_url: str | None = None
banner_photo_url: str | None = None banner_photo_url: str | None = None
# linkedin only atm # linkedin only atm

View File

@@ -17,19 +17,14 @@ class Site(Enum):
INDEED = "indeed" INDEED = "indeed"
ZIP_RECRUITER = "zip_recruiter" ZIP_RECRUITER = "zip_recruiter"
GLASSDOOR = "glassdoor" GLASSDOOR = "glassdoor"
GOOGLE = "google"
BAYT = "bayt"
class SalarySource(Enum): class SalarySource(Enum):
DIRECT_DATA = "direct_data" DIRECT_DATA = "direct_data"
DESCRIPTION = "description" DESCRIPTION = "description"
class ScraperInput(BaseModel): class ScraperInput(BaseModel):
site_type: list[Site] site_type: list[Site]
search_term: str | None = None search_term: str | None = None
google_search_term: str | None = None
location: str | None = None location: str | None = None
country: Country | None = Country.USA country: Country | None = Country.USA
@@ -47,9 +42,7 @@ class ScraperInput(BaseModel):
class Scraper(ABC): class Scraper(ABC):
def __init__( def __init__(self, site: Site, proxies: list[str] | None = None, ca_cert: str | None = None):
self, site: Site, proxies: list[str] | None = None, ca_cert: str | None = None
):
self.site = site self.site = site
self.proxies = proxies self.proxies = proxies
self.ca_cert = ca_cert self.ca_cert = ca_cert

View File

@@ -1,145 +0,0 @@
"""
jobspy.scrapers.bayt
~~~~~~~~~~~~~~~~~~~
This module contains routines to scrape Bayt.
"""
from __future__ import annotations
import random
import time
from bs4 import BeautifulSoup
from .. import Scraper, ScraperInput, Site
from ..utils import create_logger, create_session
from ...jobs import JobPost, JobResponse, Location, Country
log = create_logger("Bayt")
class BaytScraper(Scraper):
base_url = "https://www.bayt.com"
delay = 2
band_delay = 3
def __init__(
self, proxies: list[str] | str | None = None, ca_cert: str | None = None
):
super().__init__(Site.BAYT, proxies=proxies, ca_cert=ca_cert)
self.scraper_input = None
self.session = None
self.country = "worldwide"
def scrape(self, scraper_input: ScraperInput) -> JobResponse:
self.scraper_input = scraper_input
self.session = create_session(
proxies=self.proxies, ca_cert=self.ca_cert, is_tls=False, has_retry=True
)
job_list: list[JobPost] = []
page = 1
results_wanted = (
scraper_input.results_wanted if scraper_input.results_wanted else 10
)
while len(job_list) < results_wanted:
log.info(f"Fetching Bayt jobs page {page}")
job_elements = self._fetch_jobs(self.scraper_input.search_term, page)
if not job_elements:
break
if job_elements:
log.debug(
"First job element snippet:\n" + job_elements[0].prettify()[:500]
)
initial_count = len(job_list)
for job in job_elements:
try:
job_post = self._extract_job_info(job)
if job_post:
job_list.append(job_post)
if len(job_list) >= results_wanted:
break
else:
log.debug(
"Extraction returned None. Job snippet:\n"
+ job.prettify()[:500]
)
except Exception as e:
log.error(f"Bayt: Error extracting job info: {str(e)}")
continue
if len(job_list) == initial_count:
log.info(f"No new jobs found on page {page}. Ending pagination.")
break
page += 1
time.sleep(random.uniform(self.delay, self.delay + self.band_delay))
job_list = job_list[: scraper_input.results_wanted]
return JobResponse(jobs=job_list)
def _fetch_jobs(self, query: str, page: int) -> list | None:
"""
Grabs the job results for the given query and page number.
"""
try:
url = f"{self.base_url}/en/international/jobs/{query}-jobs/?page={page}"
response = self.session.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
job_listings = soup.find_all("li", attrs={"data-js-job": ""})
log.debug(f"Found {len(job_listings)} job listing elements")
return job_listings
except Exception as e:
log.error(f"Bayt: Error fetching jobs - {str(e)}")
return None
def _extract_job_info(self, job: BeautifulSoup) -> JobPost | None:
"""
Extracts the job information from a single job listing.
"""
# Find the h2 element holding the title and link (no class filtering)
job_general_information = job.find("h2")
if not job_general_information:
return
job_title = job_general_information.get_text(strip=True)
job_url = self._extract_job_url(job_general_information)
if not job_url:
return
# Extract company name using the original approach:
company_tag = job.find("div", class_="t-nowrap p10l")
company_name = (
company_tag.find("span").get_text(strip=True)
if company_tag and company_tag.find("span")
else None
)
# Extract location using the original approach:
location_tag = job.find("div", class_="t-mute t-small")
location = location_tag.get_text(strip=True) if location_tag else None
job_id = f"bayt-{abs(hash(job_url))}"
location_obj = Location(
city=location,
country=Country.from_string(self.country),
)
return JobPost(
id=job_id,
title=job_title,
company_name=company_name,
location=location_obj,
job_url=job_url,
)
def _extract_job_url(self, job_general_information: BeautifulSoup) -> str | None:
"""
Pulls the job URL from the 'a' within the h2 element.
"""
a_tag = job_general_information.find("a")
if a_tag and a_tag.has_attr("href"):
return self.base_url + a_tag["href"].strip()

View File

@@ -24,13 +24,3 @@ class ZipRecruiterException(Exception):
class GlassdoorException(Exception): class GlassdoorException(Exception):
def __init__(self, message=None): def __init__(self, message=None):
super().__init__(message or "An error occurred with Glassdoor") super().__init__(message or "An error occurred with Glassdoor")
class GoogleJobsException(Exception):
def __init__(self, message=None):
super().__init__(message or "An error occurred with Google Jobs")
class BaytException(Exception):
def __init__(self, message=None):
super().__init__(message or "An error occurred with Bayt")

View File

@@ -32,7 +32,7 @@ from ...jobs import (
DescriptionFormat, DescriptionFormat,
) )
log = create_logger("Glassdoor") logger = create_logger("Glassdoor")
class GlassdoorScraper(Scraper): class GlassdoorScraper(Scraper):
@@ -64,7 +64,7 @@ class GlassdoorScraper(Scraper):
self.base_url = self.scraper_input.country.get_glassdoor_url() self.base_url = self.scraper_input.country.get_glassdoor_url()
self.session = create_session( self.session = create_session(
proxies=self.proxies, ca_cert=self.ca_cert, has_retry=True proxies=self.proxies, ca_cert=self.ca_cert, is_tls=True, has_retry=True
) )
token = self._get_csrf_token() token = self._get_csrf_token()
headers["gd-csrf-token"] = token if token else fallback_token headers["gd-csrf-token"] = token if token else fallback_token
@@ -74,7 +74,7 @@ class GlassdoorScraper(Scraper):
scraper_input.location, scraper_input.is_remote scraper_input.location, scraper_input.is_remote
) )
if location_type is None: if location_type is None:
log.error("Glassdoor: location not parsed") logger.error("Glassdoor: location not parsed")
return JobResponse(jobs=[]) return JobResponse(jobs=[])
job_list: list[JobPost] = [] job_list: list[JobPost] = []
cursor = None cursor = None
@@ -83,7 +83,7 @@ class GlassdoorScraper(Scraper):
tot_pages = (scraper_input.results_wanted // self.jobs_per_page) + 2 tot_pages = (scraper_input.results_wanted // self.jobs_per_page) + 2
range_end = min(tot_pages, self.max_pages + 1) range_end = min(tot_pages, self.max_pages + 1)
for page in range(range_start, range_end): for page in range(range_start, range_end):
log.info(f"search page: {page} / {range_end - 1}") logger.info(f"search page: {page} / {range_end-1}")
try: try:
jobs, cursor = self._fetch_jobs_page( jobs, cursor = self._fetch_jobs_page(
scraper_input, location_id, location_type, page, cursor scraper_input, location_id, location_type, page, cursor
@@ -93,7 +93,7 @@ class GlassdoorScraper(Scraper):
job_list = job_list[: scraper_input.results_wanted] job_list = job_list[: scraper_input.results_wanted]
break break
except Exception as e: except Exception as e:
log.error(f"Glassdoor: {str(e)}") logger.error(f"Glassdoor: {str(e)}")
break break
return JobResponse(jobs=job_list) return JobResponse(jobs=job_list)
@@ -129,7 +129,7 @@ class GlassdoorScraper(Scraper):
ValueError, ValueError,
Exception, Exception,
) as e: ) as e:
log.error(f"Glassdoor: {str(e)}") logger.error(f"Glassdoor: {str(e)}")
return jobs, None return jobs, None
jobs_data = res_json["data"]["jobListings"]["jobListings"] jobs_data = res_json["data"]["jobListings"]["jobListings"]
@@ -214,7 +214,7 @@ class GlassdoorScraper(Scraper):
is_remote=is_remote, is_remote=is_remote,
description=description, description=description,
emails=extract_emails_from_text(description) if description else None, emails=extract_emails_from_text(description) if description else None,
company_logo=company_logo, logo_photo_url=company_logo,
listing_type=listing_type, listing_type=listing_type,
) )
@@ -264,12 +264,12 @@ class GlassdoorScraper(Scraper):
if res.status_code != 200: if res.status_code != 200:
if res.status_code == 429: if res.status_code == 429:
err = f"429 Response - Blocked by Glassdoor for too many requests" err = f"429 Response - Blocked by Glassdoor for too many requests"
log.error(err) logger.error(err)
return None, None return None, None
else: else:
err = f"Glassdoor response status code {res.status_code}" err = f"Glassdoor response status code {res.status_code}"
err += f" - {res.text}" err += f" - {res.text}"
log.error(f"Glassdoor response status code {res.status_code}") logger.error(f"Glassdoor response status code {res.status_code}")
return None, None return None, None
items = res.json() items = res.json()

View File

@@ -1,247 +0,0 @@
"""
jobspy.scrapers.google
~~~~~~~~~~~~~~~~~~~
This module contains routines to scrape Google.
"""
from __future__ import annotations
import math
import re
import json
from typing import Tuple
from datetime import datetime, timedelta
from .constants import headers_jobs, headers_initial, async_param
from .. import Scraper, ScraperInput, Site
from ..utils import extract_emails_from_text, create_logger, extract_job_type
from ..utils import (
create_session,
)
from ...jobs import (
JobPost,
JobResponse,
Location,
JobType,
)
log = create_logger("Google")
class GoogleJobsScraper(Scraper):
def __init__(
self, proxies: list[str] | str | None = None, ca_cert: str | None = None
):
"""
Initializes Google Scraper with the Goodle jobs search url
"""
site = Site(Site.GOOGLE)
super().__init__(site, proxies=proxies, ca_cert=ca_cert)
self.country = None
self.session = None
self.scraper_input = None
self.jobs_per_page = 10
self.seen_urls = set()
self.url = "https://www.google.com/search"
self.jobs_url = "https://www.google.com/async/callback:550"
def scrape(self, scraper_input: ScraperInput) -> JobResponse:
"""
Scrapes Google for jobs with scraper_input criteria.
:param scraper_input: Information about job search criteria.
:return: JobResponse containing a list of jobs.
"""
self.scraper_input = scraper_input
self.scraper_input.results_wanted = min(900, scraper_input.results_wanted)
self.session = create_session(
proxies=self.proxies, ca_cert=self.ca_cert, is_tls=False, has_retry=True
)
forward_cursor, job_list = self._get_initial_cursor_and_jobs()
if forward_cursor is None:
log.warning(
"initial cursor not found, try changing your query or there was at most 10 results"
)
return JobResponse(jobs=job_list)
page = 1
while (
len(self.seen_urls) < scraper_input.results_wanted + scraper_input.offset
and forward_cursor
):
log.info(
f"search page: {page} / {math.ceil(scraper_input.results_wanted / self.jobs_per_page)}"
)
try:
jobs, forward_cursor = self._get_jobs_next_page(forward_cursor)
except Exception as e:
log.error(f"failed to get jobs on page: {page}, {e}")
break
if not jobs:
log.info(f"found no jobs on page: {page}")
break
job_list += jobs
page += 1
return JobResponse(
jobs=job_list[
scraper_input.offset : scraper_input.offset
+ scraper_input.results_wanted
]
)
def _get_initial_cursor_and_jobs(self) -> Tuple[str, list[JobPost]]:
"""Gets initial cursor and jobs to paginate through job listings"""
query = f"{self.scraper_input.search_term} jobs"
def get_time_range(hours_old):
if hours_old <= 24:
return "since yesterday"
elif hours_old <= 72:
return "in the last 3 days"
elif hours_old <= 168:
return "in the last week"
else:
return "in the last month"
job_type_mapping = {
JobType.FULL_TIME: "Full time",
JobType.PART_TIME: "Part time",
JobType.INTERNSHIP: "Internship",
JobType.CONTRACT: "Contract",
}
if self.scraper_input.job_type in job_type_mapping:
query += f" {job_type_mapping[self.scraper_input.job_type]}"
if self.scraper_input.location:
query += f" near {self.scraper_input.location}"
if self.scraper_input.hours_old:
time_filter = get_time_range(self.scraper_input.hours_old)
query += f" {time_filter}"
if self.scraper_input.is_remote:
query += " remote"
if self.scraper_input.google_search_term:
query = self.scraper_input.google_search_term
params = {"q": query, "udm": "8"}
response = self.session.get(self.url, headers=headers_initial, params=params)
pattern_fc = r'<div jsname="Yust4d"[^>]+data-async-fc="([^"]+)"'
match_fc = re.search(pattern_fc, response.text)
data_async_fc = match_fc.group(1) if match_fc else None
jobs_raw = self._find_job_info_initial_page(response.text)
jobs = []
for job_raw in jobs_raw:
job_post = self._parse_job(job_raw)
if job_post:
jobs.append(job_post)
return data_async_fc, jobs
def _get_jobs_next_page(self, forward_cursor: str) -> Tuple[list[JobPost], str]:
params = {"fc": [forward_cursor], "fcv": ["3"], "async": [async_param]}
response = self.session.get(self.jobs_url, headers=headers_jobs, params=params)
return self._parse_jobs(response.text)
def _parse_jobs(self, job_data: str) -> Tuple[list[JobPost], str]:
"""
Parses jobs on a page with next page cursor
"""
start_idx = job_data.find("[[[")
end_idx = job_data.rindex("]]]") + 3
s = job_data[start_idx:end_idx]
parsed = json.loads(s)[0]
pattern_fc = r'data-async-fc="([^"]+)"'
match_fc = re.search(pattern_fc, job_data)
data_async_fc = match_fc.group(1) if match_fc else None
jobs_on_page = []
for array in parsed:
_, job_data = array
if not job_data.startswith("[[["):
continue
job_d = json.loads(job_data)
job_info = self._find_job_info(job_d)
job_post = self._parse_job(job_info)
if job_post:
jobs_on_page.append(job_post)
return jobs_on_page, data_async_fc
def _parse_job(self, job_info: list):
job_url = job_info[3][0][0] if job_info[3] and job_info[3][0] else None
if job_url in self.seen_urls:
return
self.seen_urls.add(job_url)
title = job_info[0]
company_name = job_info[1]
location = city = job_info[2]
state = country = date_posted = None
if location and "," in location:
city, state, *country = [*map(lambda x: x.strip(), location.split(","))]
days_ago_str = job_info[12]
if type(days_ago_str) == str:
match = re.search(r"\d+", days_ago_str)
days_ago = int(match.group()) if match else None
date_posted = (datetime.now() - timedelta(days=days_ago)).date()
description = job_info[19]
job_post = JobPost(
id=f"go-{job_info[28]}",
title=title,
company_name=company_name,
location=Location(
city=city, state=state, country=country[0] if country else None
),
job_url=job_url,
date_posted=date_posted,
is_remote="remote" in description.lower() or "wfh" in description.lower(),
description=description,
emails=extract_emails_from_text(description),
job_type=extract_job_type(description),
)
return job_post
@staticmethod
def _find_job_info(jobs_data: list | dict) -> list | None:
"""Iterates through the JSON data to find the job listings"""
if isinstance(jobs_data, dict):
for key, value in jobs_data.items():
if key == "520084652" and isinstance(value, list):
return value
else:
result = GoogleJobsScraper._find_job_info(value)
if result:
return result
elif isinstance(jobs_data, list):
for item in jobs_data:
result = GoogleJobsScraper._find_job_info(item)
if result:
return result
return None
@staticmethod
def _find_job_info_initial_page(html_text: str):
pattern = f'520084652":(' + r"\[.*?\]\s*])\s*}\s*]\s*]\s*]\s*]\s*]"
results = []
matches = re.finditer(pattern, html_text)
import json
for match in matches:
try:
parsed_data = json.loads(match.group(1))
results.append(parsed_data)
except json.JSONDecodeError as e:
log.error(f"Failed to parse match: {str(e)}")
results.append({"raw_match": match.group(0), "error": str(e)})
return results

View File

@@ -1,52 +0,0 @@
headers_initial = {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"accept-language": "en-US,en;q=0.9",
"priority": "u=0, i",
"referer": "https://www.google.com/",
"sec-ch-prefers-color-scheme": "dark",
"sec-ch-ua": '"Chromium";v="130", "Google Chrome";v="130", "Not?A_Brand";v="99"',
"sec-ch-ua-arch": '"arm"',
"sec-ch-ua-bitness": '"64"',
"sec-ch-ua-form-factors": '"Desktop"',
"sec-ch-ua-full-version": '"130.0.6723.58"',
"sec-ch-ua-full-version-list": '"Chromium";v="130.0.6723.58", "Google Chrome";v="130.0.6723.58", "Not?A_Brand";v="99.0.0.0"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-model": '""',
"sec-ch-ua-platform": '"macOS"',
"sec-ch-ua-platform-version": '"15.0.1"',
"sec-ch-ua-wow64": "?0",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36",
"x-browser-channel": "stable",
"x-browser-copyright": "Copyright 2024 Google LLC. All rights reserved.",
"x-browser-year": "2024",
}
headers_jobs = {
"accept": "*/*",
"accept-language": "en-US,en;q=0.9",
"priority": "u=1, i",
"referer": "https://www.google.com/",
"sec-ch-prefers-color-scheme": "dark",
"sec-ch-ua": '"Chromium";v="130", "Google Chrome";v="130", "Not?A_Brand";v="99"',
"sec-ch-ua-arch": '"arm"',
"sec-ch-ua-bitness": '"64"',
"sec-ch-ua-form-factors": '"Desktop"',
"sec-ch-ua-full-version": '"130.0.6723.58"',
"sec-ch-ua-full-version-list": '"Chromium";v="130.0.6723.58", "Google Chrome";v="130.0.6723.58", "Not?A_Brand";v="99.0.0.0"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-model": '""',
"sec-ch-ua-platform": '"macOS"',
"sec-ch-ua-platform-version": '"15.0.1"',
"sec-ch-ua-wow64": "?0",
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-origin",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36",
}
async_param = "_basejs:/xjs/_/js/k=xjs.s.en_US.JwveA-JiKmg.2018.O/am=AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIAAAAAAAAACAAAoICAAAAAAAKMAfAAAAIAQAAAAAAAAAAAAACCAAAEJDAAACAAAAAGABAIAAARBAAABAAAAAgAgQAABAASKAfv8JAAABAAAAAAwAQAQACQAAAAAAcAEAQABoCAAAABAAAIABAACAAAAEAAAAFAAAAAAAAAAAAAAAAAAAAAAAAACAQADoBwAAAAAAAAAAAAAQBAAAAATQAAoACOAHAAAAAAAAAQAAAIIAAAA_ZAACAAAAAAAAcB8APB4wHFJ4AAAAAAAAAAAAAAAACECCYA5If0EACAAAAAAAAAAAAAAAAAAAUgRNXG4AMAE/dg=0/br=1/rs=ACT90oGxMeaFMCopIHq5tuQM-6_3M_VMjQ,_basecss:/xjs/_/ss/k=xjs.s.IwsGu62EDtU.L.B1.O/am=QOoQIAQAAAQAREADEBAAAAAAAAAAAAAAAAAAAAAgAQAAIAAAgAQAAAIAIAIAoEwCAADIC8AfsgEAawwAPkAAjgoAGAAAAAAAAEADAAAAAAIgAECHAAAAAAAAAAABAQAggAARQAAAQCEAAAAAIAAAABgAAAAAIAQIACCAAfB-AAFIQABoCEA_CgEAAIABAACEgHAEwwAEFQAM4CgAAAAAAAAAAAAACABCAAAAQEAAABAgAMCPAAA4AoE2BAEAggSAAIoAQAAAAAgAAAAACCAQAAAxEwA_ZAACAAAAAAAAAAkAAAAAAAAgAAAAAAAAAAAAAAAAAAAAAAAAQAEAAAAAAAAAAAAAAAAAAAAAQA/br=1/rs=ACT90oGZc36t3uUQkj0srnIvvbHjO2hgyg,_basecomb:/xjs/_/js/k=xjs.s.en_US.JwveA-JiKmg.2018.O/ck=xjs.s.IwsGu62EDtU.L.B1.O/am=QOoQIAQAAAQAREADEBAAAAAAAAAAAAAAAAAAAAAgAQAAIAAAgAQAAAKAIAoIqEwCAADIK8AfsgEAawwAPkAAjgoAGAAACCAAAEJDAAACAAIgAGCHAIAAARBAAABBAQAggAgRQABAQSOAfv8JIAABABgAAAwAYAQICSCAAfB-cAFIQABoCEA_ChEAAIABAACEgHAEwwAEFQAM4CgAAAAAAAAAAAAACABCAACAQEDoBxAgAMCPAAA4AoE2BAEAggTQAIoASOAHAAgAAAAACSAQAIIxEwA_ZAACAAAAAAAAcB8APB4wHFJ4AAAAAAAAAAAAAAAACECCYA5If0EACAAAAAAAAAAAAAAAAAAAUgRNXG4AMAE/d=1/ed=1/dg=0/br=1/ujg=1/rs=ACT90oFNLTjPzD_OAqhhtXwe2pg1T3WpBg,_fmt:prog,_id:fc_5FwaZ86OKsfdwN4P4La3yA4_2"

View File

@@ -30,7 +30,7 @@ from ...jobs import (
DescriptionFormat, DescriptionFormat,
) )
log = create_logger("Indeed") logger = create_logger("Indeed")
class IndeedScraper(Scraper): class IndeedScraper(Scraper):
@@ -69,23 +69,25 @@ class IndeedScraper(Scraper):
page = 1 page = 1
cursor = None cursor = None
offset_pages = math.ceil(self.scraper_input.offset / 100)
for _ in range(offset_pages):
logger.info(f"skipping search page: {page}")
__, cursor = self._scrape_page(cursor)
if not __:
logger.info(f"found no jobs on page: {page}")
break
while len(self.seen_urls) < scraper_input.results_wanted + scraper_input.offset: while len(self.seen_urls) < scraper_input.results_wanted:
log.info( logger.info(
f"search page: {page} / {math.ceil(scraper_input.results_wanted / self.jobs_per_page)}" f"search page: {page} / {math.ceil(scraper_input.results_wanted / 100)}"
) )
jobs, cursor = self._scrape_page(cursor) jobs, cursor = self._scrape_page(cursor)
if not jobs: if not jobs:
log.info(f"found no jobs on page: {page}") logger.info(f"found no jobs on page: {page}")
break break
job_list += jobs job_list += jobs
page += 1 page += 1
return JobResponse( return JobResponse(jobs=job_list[: scraper_input.results_wanted])
jobs=job_list[
scraper_input.offset : scraper_input.offset
+ scraper_input.results_wanted
]
)
def _scrape_page(self, cursor: str | None) -> Tuple[list[JobPost], str | None]: def _scrape_page(self, cursor: str | None) -> Tuple[list[JobPost], str | None]:
""" """
@@ -122,10 +124,9 @@ class IndeedScraper(Scraper):
headers=api_headers_temp, headers=api_headers_temp,
json=payload, json=payload,
timeout=10, timeout=10,
verify=False,
) )
if not response.ok: if not response.ok:
log.info( logger.info(
f"responded with status code: {response.status_code} (submit GitHub issue if this appears to be a bug)" f"responded with status code: {response.status_code} (submit GitHub issue if this appears to be a bug)"
) )
return jobs, new_cursor return jobs, new_cursor
@@ -259,7 +260,7 @@ class IndeedScraper(Scraper):
company_num_employees=employer_details.get("employeesLocalizedLabel"), company_num_employees=employer_details.get("employeesLocalizedLabel"),
company_revenue=employer_details.get("revenueLocalizedLabel"), company_revenue=employer_details.get("revenueLocalizedLabel"),
company_description=employer_details.get("briefDescription"), company_description=employer_details.get("briefDescription"),
company_logo=( logo_photo_url=(
employer["images"].get("squareLogoUrl") employer["images"].get("squareLogoUrl")
if employer and employer.get("images") if employer and employer.get("images")
else None else None

View File

@@ -38,7 +38,7 @@ from ..utils import (
markdown_converter, markdown_converter,
) )
log = create_logger("LinkedIn") logger = create_logger("LinkedIn")
class LinkedInScraper(Scraper): class LinkedInScraper(Scraper):
@@ -86,7 +86,7 @@ class LinkedInScraper(Scraper):
) )
while continue_search(): while continue_search():
request_count += 1 request_count += 1
log.info( logger.info(
f"search page: {request_count} / {math.ceil(scraper_input.results_wanted / 10)}" f"search page: {request_count} / {math.ceil(scraper_input.results_wanted / 10)}"
) )
params = { params = {
@@ -126,13 +126,13 @@ class LinkedInScraper(Scraper):
else: else:
err = f"LinkedIn response status code {response.status_code}" err = f"LinkedIn response status code {response.status_code}"
err += f" - {response.text}" err += f" - {response.text}"
log.error(err) logger.error(err)
return JobResponse(jobs=job_list) return JobResponse(jobs=job_list)
except Exception as e: except Exception as e:
if "Proxy responded with" in str(e): if "Proxy responded with" in str(e):
log.error(f"LinkedIn: Bad proxy") logger.error(f"LinkedIn: Bad proxy")
else: else:
log.error(f"LinkedIn: {str(e)}") logger.error(f"LinkedIn: {str(e)}")
return JobResponse(jobs=job_list) return JobResponse(jobs=job_list)
soup = BeautifulSoup(response.text, "html.parser") soup = BeautifulSoup(response.text, "html.parser")
@@ -232,7 +232,7 @@ class LinkedInScraper(Scraper):
description=job_details.get("description"), description=job_details.get("description"),
job_url_direct=job_details.get("job_url_direct"), job_url_direct=job_details.get("job_url_direct"),
emails=extract_emails_from_text(job_details.get("description")), emails=extract_emails_from_text(job_details.get("description")),
company_logo=job_details.get("company_logo"), logo_photo_url=job_details.get("logo_photo_url"),
job_function=job_details.get("job_function"), job_function=job_details.get("job_function"),
) )
@@ -275,7 +275,7 @@ class LinkedInScraper(Scraper):
if job_function_span: if job_function_span:
job_function = job_function_span.text.strip() job_function = job_function_span.text.strip()
company_logo = ( logo_photo_url = (
logo_image.get("data-delayed-url") logo_image.get("data-delayed-url")
if (logo_image := soup.find("img", {"class": "artdeco-entity-image"})) if (logo_image := soup.find("img", {"class": "artdeco-entity-image"}))
else None else None
@@ -286,7 +286,7 @@ class LinkedInScraper(Scraper):
"company_industry": self._parse_company_industry(soup), "company_industry": self._parse_company_industry(soup),
"job_type": self._parse_job_type(soup), "job_type": self._parse_job_type(soup),
"job_url_direct": self._parse_job_url_direct(soup), "job_url_direct": self._parse_job_url_direct(soup),
"company_logo": company_logo, "logo_photo_url": logo_photo_url,
"job_function": job_function, "job_function": job_function,
} }

View File

@@ -1,20 +1,17 @@
from __future__ import annotations from __future__ import annotations
import logging
import re import re
import logging
from itertools import cycle from itertools import cycle
import numpy as np
import requests import requests
import tls_client import tls_client
import urllib3 import numpy as np
from markdownify import markdownify as md from markdownify import markdownify as md
from requests.adapters import HTTPAdapter, Retry from requests.adapters import HTTPAdapter, Retry
from ..jobs import CompensationInterval, JobType from ..jobs import CompensationInterval, JobType
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
def create_logger(name: str): def create_logger(name: str):
logger = logging.getLogger(f"JobSpy:{name}") logger = logging.getLogger(f"JobSpy:{name}")
@@ -132,7 +129,7 @@ def create_session(
return session return session
def set_logger_level(verbose: int): def set_logger_level(verbose: int = 2):
""" """
Adjusts the logger's level. This function allows the logging level to be changed at runtime. Adjusts the logger's level. This function allows the logging level to be changed at runtime.
@@ -267,22 +264,3 @@ def extract_salary(
else: else:
return interval, min_salary, max_salary, "USD" return interval, min_salary, max_salary, "USD"
return None, None, None, None return None, None, None, None
def extract_job_type(description: str):
if not description:
return []
keywords = {
JobType.FULL_TIME: r"full\s?time",
JobType.PART_TIME: r"part\s?time",
JobType.INTERNSHIP: r"internship",
JobType.CONTRACT: r"contract",
}
listing_types = []
for key, pattern in keywords.items():
if re.search(pattern, description, re.IGNORECASE):
listing_types.append(key)
return listing_types if listing_types else None

View File

@@ -11,10 +11,11 @@ import json
import math import math
import re import re
import time import time
from concurrent.futures import ThreadPoolExecutor
from datetime import datetime from datetime import datetime
from typing import Optional, Tuple, Any from typing import Optional, Tuple, Any
from concurrent.futures import ThreadPoolExecutor
from bs4 import BeautifulSoup from bs4 import BeautifulSoup
from .constants import headers from .constants import headers
@@ -36,7 +37,7 @@ from ...jobs import (
DescriptionFormat, DescriptionFormat,
) )
log = create_logger("ZipRecruiter") logger = create_logger("ZipRecruiter")
class ZipRecruiterScraper(Scraper): class ZipRecruiterScraper(Scraper):
@@ -76,7 +77,7 @@ class ZipRecruiterScraper(Scraper):
break break
if page > 1: if page > 1:
time.sleep(self.delay) time.sleep(self.delay)
log.info(f"search page: {page} / {max_pages}") logger.info(f"search page: {page} / {max_pages}")
jobs_on_page, continue_token = self._find_jobs_in_page( jobs_on_page, continue_token = self._find_jobs_in_page(
scraper_input, continue_token scraper_input, continue_token
) )
@@ -109,13 +110,13 @@ class ZipRecruiterScraper(Scraper):
else: else:
err = f"ZipRecruiter response status code {res.status_code}" err = f"ZipRecruiter response status code {res.status_code}"
err += f" with response: {res.text}" # ZipRecruiter likely not available in EU err += f" with response: {res.text}" # ZipRecruiter likely not available in EU
log.error(err) logger.error(err)
return jobs_list, "" return jobs_list, ""
except Exception as e: except Exception as e:
if "Proxy responded with" in str(e): if "Proxy responded with" in str(e):
log.error(f"Indeed: Bad proxy") logger.error(f"Indeed: Bad proxy")
else: else:
log.error(f"Indeed: {str(e)}") logger.error(f"Indeed: {str(e)}")
return jobs_list, "" return jobs_list, ""
res_data = res.json() res_data = res.json()
@@ -214,28 +215,7 @@ class ZipRecruiterScraper(Scraper):
return description_full, job_url_direct return description_full, job_url_direct
def _get_cookies(self): def _get_cookies(self):
""" data = "event_type=session&logged_in=false&number_of_retry=1&property=model%3AiPhone&property=os%3AiOS&property=locale%3Aen_us&property=app_build_number%3A4734&property=app_version%3A91.0&property=manufacturer%3AApple&property=timestamp%3A2024-01-12T12%3A04%3A42-06%3A00&property=screen_height%3A852&property=os_version%3A16.6.1&property=source%3Ainstall&property=screen_width%3A393&property=device_model%3AiPhone%2014%20Pro&property=brand%3AApple"
Sends a session event to the API with device properties.
"""
data = [
("event_type", "session"),
("logged_in", "false"),
("number_of_retry", "1"),
("property", "model:iPhone"),
("property", "os:iOS"),
("property", "locale:en_us"),
("property", "app_build_number:4734"),
("property", "app_version:91.0"),
("property", "manufacturer:Apple"),
("property", "timestamp:2025-01-12T12:04:42-06:00"),
("property", "screen_height:852"),
("property", "os_version:16.6.1"),
("property", "source:install"),
("property", "screen_width:393"),
("property", "device_model:iPhone 14 Pro"),
("property", "brand:Apple"),
]
url = f"{self.api_url}/jobs-app/event" url = f"{self.api_url}/jobs-app/event"
self.session.post(url, data=data) self.session.post(url, data=data)

0
tests/__init__.py Normal file
View File

18
tests/test_all.py Normal file
View File

@@ -0,0 +1,18 @@
from jobspy import scrape_jobs
import pandas as pd
def test_all():
sites = [
"indeed",
"glassdoor",
] # ziprecruiter/linkedin needs good ip, and temp fix to pass test on ci
result = scrape_jobs(
site_name=sites,
search_term="engineer",
results_wanted=5,
)
assert (
isinstance(result, pd.DataFrame) and len(result) == len(sites) * 5
), "Result should be a non-empty DataFrame"

13
tests/test_glassdoor.py Normal file
View File

@@ -0,0 +1,13 @@
from jobspy import scrape_jobs
import pandas as pd
def test_glassdoor():
result = scrape_jobs(
site_name="glassdoor",
search_term="engineer",
results_wanted=5,
)
assert (
isinstance(result, pd.DataFrame) and len(result) == 5
), "Result should be a non-empty DataFrame"

13
tests/test_indeed.py Normal file
View File

@@ -0,0 +1,13 @@
from jobspy import scrape_jobs
import pandas as pd
def test_indeed():
result = scrape_jobs(
site_name="indeed",
search_term="engineer",
results_wanted=5,
)
assert (
isinstance(result, pd.DataFrame) and len(result) == 5
), "Result should be a non-empty DataFrame"

9
tests/test_linkedin.py Normal file
View File

@@ -0,0 +1,9 @@
from jobspy import scrape_jobs
import pandas as pd
def test_linkedin():
result = scrape_jobs(site_name="linkedin", search_term="engineer", results_wanted=5)
assert (
isinstance(result, pd.DataFrame) and len(result) == 5
), "Result should be a non-empty DataFrame"

View File

@@ -0,0 +1,12 @@
from jobspy import scrape_jobs
import pandas as pd
def test_ziprecruiter():
result = scrape_jobs(
site_name="zip_recruiter", search_term="software engineer", results_wanted=5
)
assert (
isinstance(result, pd.DataFrame) and len(result) == 5
), "Result should be a non-empty DataFrame"