Compare commits

..

79 Commits

Author SHA1 Message Date
Piotr Geca 94d413bad1
support for socks5 proxies (#266)
Co-authored-by: Piotr Geca <piotr.geca@npl.co.uk>
2025-04-10 15:53:28 -05:00
Cullen Watson 61205bcc77
chore: version 2025-03-27 21:59:47 -05:00
Nikhil Sasi f1602eca70
Fix date parsing error: prevent negative days by using timedelta (#264)
subtracting extracted "days" from label with current day causes negative days
datetime class rejects negative day association
Use timedelta for proper date limitation

Co-authored-by: NIKHIL S <nikhil_s@nikhilMac.local>
2025-03-27 21:58:42 -05:00
Cullen Watson d4d52d05f5 chore:version 2025-03-21 17:35:23 -05:00
Liju Thomas 0946cb3373
feat: add naukri.com support (#259) 2025-03-21 17:23:07 -05:00
prudvisorra-aifa 051981689f
Update util.py (#256) 2025-03-17 11:51:19 -05:00
Cullen Watson 903b7e6f1b fix(linkedin):is remote 2025-03-06 13:38:28 -06:00
Cullen Watson 6782b9884e fix:workflow 2025-03-01 14:49:31 -06:00
Cullen Watson 94c74d60f2
enh:workflow manual run 2025-03-01 14:47:24 -06:00
Cullen Watson 5463e5a664 chore:version 2025-03-01 14:38:25 -06:00
arkhy ed139e7e6b
added missing EU countries and languages (#250)
Co-authored-by: Kate Arkhangelskaya <ekar559e@tu-dresden.de>
2025-03-01 14:30:08 -06:00
Cullen Watson 5bd199d0a5 Merge branch 'main' of https://github.com/Bunsly/JobSpy 2025-02-21 14:15:06 -06:00
Cullen Watson 4ec308a302 refactor:organize code 2025-02-21 14:14:55 -06:00
Cullen Watson 7cb0c518fc
docs:readme 2025-02-21 12:53:59 -06:00
Cullen Watson df70d4bc2e minor 2025-02-21 12:35:31 -06:00
Cullen Watson 3006063875 enh:remove log by default 2025-02-21 12:31:04 -06:00
Abdulrahman Hisham 1be009b8bc
Adding Bayt.com Scraper to current codebase (#246) 2025-02-21 12:29:54 -06:00
Cullen Watson 81ed9b3ddf enh:remove log by default 2025-02-21 12:29:28 -06:00
Abdulrahman Al Muaitah 11a9e9a56a Fixed Bayt scraper integration 2025-02-21 20:10:02 +04:00
Abdulrahman Al Muaitah c6ade14784 Added Bayt Scraper integration 2025-02-21 15:31:29 +04:00
Cullen Watson 13c74a0fed
docs:readme 2025-02-09 13:42:18 -06:00
Cullen Watson 333e9e6760
docs:readme 2025-01-17 21:44:49 -06:00
github-actions 04032a0f91 Increment version 2024-12-04 22:55:06 +00:00
Cullen Watson 496896d0b5
enh:fix yml (#225) 2024-12-04 16:54:52 -06:00
Cullen Watson 87ba1ad1bf
fix yml 2024-12-04 16:52:15 -06:00
Jason Geffner 4e7ac9a583
Fix Google job search (#223)
The previous regex did not capture all expected matches in the returned content
2024-12-04 16:45:59 -06:00
Cullen Watson e44d13e1cf enh:auto update version 2024-12-04 16:29:38 -06:00
Cullen Watson d52e366ef7
docs:readme 2024-11-26 15:51:26 -06:00
Cullen Watson 395ebf0017
docs:readme 2024-11-26 15:49:12 -06:00
Cullen Watson 63fddd9b7f
docs:readme 2024-11-26 15:48:22 -06:00
Cullen Watson 58956868ae
docs:readme 2024-11-26 15:47:10 -06:00
Cullen Watson 4fce836222
docs:readme 2024-10-28 03:53:59 -05:00
Cullen Watson 5ba25e7a7c
docs:readme 2024-10-28 03:42:19 -05:00
Cullen Watson f7cb3e9206
docs:readme 2024-10-28 03:36:21 -05:00
Cullen Watson 3ad3f121f7
docs:readme 2024-10-28 03:34:52 -05:00
Cullen Watson ff3c782912
docs:readme 2024-10-25 18:12:08 -05:00
Cullen Watson 338d854b96
fix(google): search (#216) 2024-10-25 14:54:14 -05:00
Cullen Watson 811d4c40b4 chore:version 2024-10-24 15:28:25 -05:00
Cullen Watson dba92d22c2 chore:version 2024-10-24 15:27:16 -05:00
Cullen Watson 10a3592a0f docs:file 2024-10-24 15:26:49 -05:00
Cullen Watson b7905cc756 docs:file 2024-10-24 15:24:18 -05:00
Cullen Watson 6867d58829 docs:readme 2024-10-24 15:22:31 -05:00
Cullen Watson f6248c8386
enh: google jobs (#214) 2024-10-24 15:19:40 -05:00
Cullen Watson f395597fdd fix(indeed): offset 2024-10-22 19:25:07 -05:00
Cullen Watson 6372e41bd9
chore:version 2024-10-20 00:19:31 -05:00
Olzhas Arystanov 6c869decb8
build(deps): bump markdownify to 0.13.1 (#211) 2024-10-20 00:18:44 -05:00
Cullen Watson 9f4083380d
indeed:remove tpe (#210) 2024-10-19 18:01:59 -05:00
Olzhas Arystanov 9207ab56f6
fix: extract tests out of src (#209) 2024-10-19 16:56:38 -05:00
Cullen Watson 757a94853e chore:version 2024-10-08 17:49:06 -05:00
Marcel Gozalbo Baró 6bc191d5c7
FEATURE: Add the "ca_cert" setting for providing a Certification Authority certificate in order to use proxies requiring it. (#204) 2024-10-08 17:46:46 -05:00
Cullen Watson 0cc34287f7 fix:turkey 2024-10-02 01:31:00 -05:00
Anton Pikhteryev 923979093b
Add Malta for linkedin country support (#198) 2024-09-19 20:41:22 -05:00
Cullen Watson 286f0e4487
docs:readme 2024-09-18 18:49:41 -05:00
Cullen Watson f7b29d43a2
fix(indeed):sort relevance not date (#197) 2024-09-18 18:42:25 -05:00
Cullen Watson 6f1490458c
fix key error (#186) 2024-08-14 02:54:40 -05:00
Cullen Watson 6bb7d81ba8
change linkedin ep (#185) 2024-08-14 02:39:43 -05:00
Cullen Watson 0e046432d1
fix:variable bug (#181) 2024-08-05 12:47:55 -05:00
Cullen Watson 209e0e65b6
fix:malaysia indeed (#180) 2024-08-03 22:48:53 -05:00
Cullen Watson 8570c0651e
fix:key error (#176) 2024-07-21 13:05:18 -05:00
Cullen Watson 8678b0bbe4
enh: test on pr (#174) 2024-07-19 14:25:25 -05:00
Cullen Watson 60d4d911c9
lock file (#173) 2024-07-17 21:21:22 -05:00
Lluís Salord Quetglas 2a0cba8c7e
FEAT: Optional convertion to annual and know salary source (#170) 2024-07-17 21:05:33 -05:00
Mason DePalma de70189fa2
Update pyproject.toml (#172)
Changed Numpy to the most recent version so the package can properly install
2024-07-17 20:54:08 -05:00
Cullen Watson b55c0eb86d docs:readme 2024-07-16 19:24:38 -05:00
Cullen Watson 88c95c4ad5
enh: estimated salary (#169) 2024-07-16 19:20:34 -05:00
Cullen Watson d8d33d602f
docs: readme 2024-07-15 21:30:11 -05:00
Cullen Watson 6330c14879 minor fix 2024-07-15 21:19:01 -05:00
Ali Bakhshi Ilani 48631ea271
Add company industry and job level to linkedin scraper (#166) 2024-07-15 21:07:39 -05:00
Cullen Watson edffe18e65
enh: listing source (#168) 2024-07-15 20:30:04 -05:00
Lluís Salord Quetglas 0988230a24
FEAT: Add Glassdoor logo data if available (#167) 2024-07-15 20:25:18 -05:00
Cullen Watson d000a81eb3
Salary parse (#163) 2024-06-09 17:45:38 -05:00
Cullen Watson ccb0c17660
enh: ziprecruiter full description (#162) 2024-06-09 16:21:01 -05:00
Cullen Watson df339610fa
docs: readme 2024-05-29 19:32:32 -05:00
Cullen Watson c501006bd8
docs: readme 2024-05-28 16:04:26 -05:00
Cullen Watson 89a3ee231c
enh(li): job function (#160) 2024-05-28 16:01:29 -05:00
Cullen 6439f71433 chore: version 2024-05-28 15:39:24 -05:00
adamagassi 7f6271b2e0
LinkedIn scraper fixes: (#159)
Correct initial page offset calculation
Separate page variable from request counter
Fix job offset starting value
Increment offset by number of jobs returned instead of expected value
2024-05-28 15:38:13 -05:00
Cullen Watson 5cb7ffe5fd
enh: proxies (#157)
* enh: proxies

* enh: proxies
2024-05-25 14:04:09 -05:00
Cullen Watson cd29f79796
docs: readme 2024-05-25 11:46:23 -05:00
41 changed files with 3935 additions and 2923 deletions

View File

@ -1,9 +1,13 @@
name: Publish Python 🐍 distributions 📦 to PyPI
on: push
name: Publish JobSpy to PyPi
on:
push:
branches:
- main
workflow_dispatch:
jobs:
build-n-publish:
name: Build and publish Python 🐍 distributions 📦 to PyPI
name: Build and publish JobSpy to PyPi
runs-on: ubuntu-latest
steps:
@ -27,7 +31,7 @@ jobs:
build
- name: Publish distribution 📦 to PyPI
if: startsWith(github.ref, 'refs/tags')
if: startsWith(github.ref, 'refs/tags') || github.event_name == 'workflow_dispatch'
uses: pypa/gh-action-pypi-publish@release/v1
with:
password: ${{ secrets.PYPI_API_TOKEN }}

223
README.md
View File

@ -1,20 +1,12 @@
<img src="https://github.com/cullenwatson/JobSpy/assets/78247585/ae185b7e-e444-4712-8bb9-fa97f53e896b" width="400">
**JobSpy** is a simple, yet comprehensive, job scraping library.
**Not technical?** Try out the web scraping tool on our site at [usejobspy.com](https://usejobspy.com).
*Looking to build a data-focused software product?* **[Book a call](https://bunsly.com/)** *to
work with us.*
**JobSpy** is a job scraping library with the goal of aggregating all the jobs from popular job boards with one tool.
## Features
- Scrapes job postings from **LinkedIn**, **Indeed**, **Glassdoor**, & **ZipRecruiter** simultaneously
- Aggregates the job postings in a Pandas DataFrame
- Proxy support
[Video Guide for JobSpy](https://www.youtube.com/watch?v=RuP1HrAZnxs&pp=ygUgam9icyBzY3JhcGVyIGJvdCBsaW5rZWRpbiBpbmRlZWQ%3D) -
Updated for release v1.1.3
- Scrapes job postings from **LinkedIn**, **Indeed**, **Glassdoor**, **Google**, **ZipRecruiter**, **Bayt** & **Naukri** concurrently
- Aggregates the job postings in a dataframe
- Proxies support to bypass blocking
![jobspy](https://github.com/cullenwatson/JobSpy/assets/78247585/ec7ef355-05f6-4fd3-8161-a817e31c5c57)
@ -33,17 +25,20 @@ import csv
from jobspy import scrape_jobs
jobs = scrape_jobs(
site_name=["indeed", "linkedin", "zip_recruiter", "glassdoor"],
site_name=["indeed", "linkedin", "zip_recruiter", "glassdoor", "google", "bayt", "naukri"],
search_term="software engineer",
location="Dallas, TX",
google_search_term="software engineer jobs near San Francisco, CA since yesterday",
location="San Francisco, CA",
results_wanted=20,
hours_old=72, # (only Linkedin/Indeed is hour specific, others round up to days old)
country_indeed='USA', # only needed for indeed / glassdoor
# linkedin_fetch_description=True # get full description and direct job url for linkedin (slower)
hours_old=72,
country_indeed='USA',
# linkedin_fetch_description=True # gets more info such as description, direct job url (slower)
# proxies=["208.195.175.46:65095", "208.195.175.45:65095", "localhost"],
)
print(f"Found {len(jobs)} jobs")
print(jobs.head())
jobs.to_csv("jobs.csv", quoting=csv.QUOTE_NONNUMERIC, escapechar="\\", index=False) # to_xlsx
jobs.to_csv("jobs.csv", quoting=csv.QUOTE_NONNUMERIC, escapechar="\\", index=False) # to_excel
```
### Output
@ -56,65 +51,83 @@ linkedin Software Engineer - Early Career Lockheed Martin Sunnyvale
linkedin Full-Stack Software Engineer Rain New York NY fulltime yearly None None https://www.linkedin.com/jobs/view/3696158877 Rains mission is to create the fastest and ea...
zip_recruiter Software Engineer - New Grad ZipRecruiter Santa Monica CA fulltime yearly 130000 150000 https://www.ziprecruiter.com/jobs/ziprecruiter... We offer a hybrid work environment. Most US-ba...
zip_recruiter Software Developer TEKsystems Phoenix AZ fulltime hourly 65 75 https://www.ziprecruiter.com/jobs/teksystems-0... Top Skills' Details• 6 years of Java developme...
```
### Parameters for `scrape_jobs()`
```plaintext
Optional
├── site_name (list|str): linkedin, zip_recruiter, indeed, glassdoor (default is all four)
├── site_name (list|str):
| linkedin, zip_recruiter, indeed, glassdoor, google, bayt
| (default is all)
├── search_term (str)
|
├── google_search_term (str)
| search term for google jobs. This is the only param for filtering google jobs.
├── location (str)
├── distance (int): in miles, default 50
├── job_type (str): fulltime, parttime, internship, contract
├── proxy (str): in format 'http://user:pass@host:port'
├── distance (int):
| in miles, default 50
├── job_type (str):
| fulltime, parttime, internship, contract
├── proxies (list):
| in format ['user:pass@host:port', 'localhost']
| each job board scraper will round robin through the proxies
|
├── is_remote (bool)
├── results_wanted (int): number of job results to retrieve for each site specified in 'site_name'
├── easy_apply (bool): filters for jobs that are hosted on the job board site (LinkedIn & Indeed do not allow pairing this with hours_old)
├── linkedin_fetch_description (bool): fetches full description and direct job url for LinkedIn (slower)
├── linkedin_company_ids (list[int]): searches for linkedin jobs with specific company ids
├── description_format (str): markdown, html (Format type of the job descriptions. Default is markdown.)
├── country_indeed (str): filters the country on Indeed (see below for correct spelling)
├── offset (int): starts the search from an offset (e.g. 25 will start the search from the 25th result)
├── hours_old (int): filters jobs by the number of hours since the job was posted (ZipRecruiter and Glassdoor round up to next day. If you use this on Indeed, it will not filter by job_type/is_remote/easy_apply)
├── verbose (int) {0, 1, 2}: Controls the verbosity of the runtime printouts (0 prints only errors, 1 is errors+warnings, 2 is all logs. Default is 2.)
├── hyperlinks (bool): Whether to turn `job_url`s into hyperlinks. Default is false.
├── results_wanted (int):
| number of job results to retrieve for each site specified in 'site_name'
├── easy_apply (bool):
| filters for jobs that are hosted on the job board site (LinkedIn easy apply filter no longer works)
├── description_format (str):
| markdown, html (Format type of the job descriptions. Default is markdown.)
├── offset (int):
| starts the search from an offset (e.g. 25 will start the search from the 25th result)
├── hours_old (int):
| filters jobs by the number of hours since the job was posted
| (ZipRecruiter and Glassdoor round up to next day.)
├── verbose (int) {0, 1, 2}:
| Controls the verbosity of the runtime printouts
| (0 prints only errors, 1 is errors+warnings, 2 is all logs. Default is 2.)
├── linkedin_fetch_description (bool):
| fetches full description and direct job url for LinkedIn (Increases requests by O(n))
├── linkedin_company_ids (list[int]):
| searches for linkedin jobs with specific company ids
|
├── country_indeed (str):
| filters the country on Indeed & Glassdoor (see below for correct spelling)
|
├── enforce_annual_salary (bool):
| converts wages to annual salary
|
├── ca_cert (str)
| path to CA Certificate file for proxies
```
### JobPost Schema
```plaintext
JobPost
├── title (str)
├── company (str)
├── company_url (str)
├── job_url (str)
├── location (object)
│ ├── country (str)
│ ├── city (str)
│ ├── state (str)
├── description (str)
├── job_type (str): fulltime, parttime, internship, contract
├── compensation (object)
│ ├── interval (str): yearly, monthly, weekly, daily, hourly
│ ├── min_amount (int)
│ ├── max_amount (int)
│ └── currency (enum)
└── date_posted (date)
└── emails (str)
└── is_remote (bool)
Indeed specific
├── company_country (str)
└── company_addresses (str)
└── company_industry (str)
└── company_employees_label (str)
└── company_revenue_label (str)
└── company_description (str)
└── ceo_name (str)
└── ceo_photo_url (str)
└── logo_photo_url (str)
└── banner_photo_url (str)
```
├── Indeed limitations:
| Only one from this list can be used in a search:
| - hours_old
| - job_type & is_remote
| - easy_apply
└── LinkedIn limitations:
| Only one from this list can be used in a search:
| - hours_old
| - easy_apply
```
## Supported Countries for Job Searching
@ -153,26 +166,92 @@ You can specify the following countries when searching on Indeed (use the exact
| United Arab Emirates | UK* | USA* | Uruguay |
| Venezuela | Vietnam* | | |
### **Bayt**
Bayt only uses the search_term parameter currently and searches internationally
## Notes
* Indeed is the best scraper currently with no rate limiting.
* All the job board endpoints are capped at around 1000 jobs on a given search.
* LinkedIn is the most restrictive and usually rate limits around the 10th page.
* LinkedIn is the most restrictive and usually rate limits around the 10th page with one ip. Proxies are a must basically.
## Frequently Asked Questions
---
**Q: Why is Indeed giving unrelated roles?**
**A:** Indeed searches the description too.
**Q: Encountering issues with your queries?**
**A:** Try reducing the number of `results_wanted` and/or broadening the filters. If problems
persist, [submit an issue](https://github.com/Bunsly/JobSpy/issues).
- use - to remove words
- "" for exact match
Example of a good Indeed query
```py
search_term='"engineering intern" software summer (java OR python OR c++) 2025 -tax -marketing'
```
This searches the description/title and must include software, summer, 2025, one of the languages, engineering intern exactly, no tax, no marketing.
---
**Q: No results when using "google"?**
**A:** You have to use super specific syntax. Search for google jobs on your browser and then whatever pops up in the google jobs search box after applying some filters is what you need to copy & paste into the google_search_term.
---
**Q: Received a response code 429?**
**A:** This indicates that you have been blocked by the job board site for sending too many requests. All of the job board sites are aggressive with blocking. We recommend:
- Waiting some time between scrapes (site-dependent).
- Trying a VPN or proxy to change your IP address.
- Wait some time between scrapes (site-dependent).
- Try using the proxies param to change your IP address.
---
### JobPost Schema
```plaintext
JobPost
├── title
├── company
├── company_url
├── job_url
├── location
│ ├── country
│ ├── city
│ ├── state
├── is_remote
├── description
├── job_type: fulltime, parttime, internship, contract
├── job_function
│ ├── interval: yearly, monthly, weekly, daily, hourly
│ ├── min_amount
│ ├── max_amount
│ ├── currency
│ └── salary_source: direct_data, description (parsed from posting)
├── date_posted
└── emails
Linkedin specific
└── job_level
Linkedin & Indeed specific
└── company_industry
Indeed specific
├── company_country
├── company_addresses
├── company_employees_label
├── company_revenue_label
├── company_description
└── company_logo
Naukri specific
├── skills
├── experience_range
├── company_rating
├── company_reviews_count
├── vacancy_count
└── work_from_home_type
```

View File

@ -1,30 +0,0 @@
from jobspy import scrape_jobs
import pandas as pd
jobs: pd.DataFrame = scrape_jobs(
site_name=["indeed", "linkedin", "zip_recruiter", "glassdoor"],
search_term="software engineer",
location="Dallas, TX",
results_wanted=25, # be wary the higher it is, the more likey you'll get blocked (rotating proxy can help tho)
country_indeed="USA",
# proxy="http://jobspy:5a4vpWtj8EeJ2hoYzk@ca.smartproxy.com:20001",
)
# formatting for pandas
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option("display.width", None)
pd.set_option("display.max_colwidth", 50) # set to 0 to see full job url / desc
# 1: output to console
print(jobs)
# 2: output to .csv
jobs.to_csv("./jobs.csv", index=False)
print("outputted to jobs.csv")
# 3: output to .xlsx
# jobs.to_xlsx('jobs.xlsx', index=False)
# 4: display in Jupyter Notebook (1. pip install jupyter 2. jupyter notebook)
# display(jobs)

View File

@ -1,167 +0,0 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "00a94b47-f47b-420f-ba7e-714ef219c006",
"metadata": {},
"outputs": [],
"source": [
"from jobspy import scrape_jobs\n",
"import pandas as pd\n",
"from IPython.display import display, HTML"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9f773e6c-d9fc-42cc-b0ef-63b739e78435",
"metadata": {},
"outputs": [],
"source": [
"pd.set_option('display.max_columns', None)\n",
"pd.set_option('display.max_rows', None)\n",
"pd.set_option('display.width', None)\n",
"pd.set_option('display.max_colwidth', 50)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1253c1f8-9437-492e-9dd3-e7fe51099420",
"metadata": {},
"outputs": [],
"source": [
"# example 1 (no hyperlinks, USA)\n",
"jobs = scrape_jobs(\n",
" site_name=[\"linkedin\"],\n",
" location='san francisco',\n",
" search_term=\"engineer\",\n",
" results_wanted=5,\n",
"\n",
" # use if you want to use a proxy\n",
" # proxy=\"socks5://jobspy:5a4vpWtj4EeJ2hoYzk@us.smartproxy.com:10001\",\n",
" proxy=\"http://jobspy:5a4vpWtj4EeJ2hoYzk@us.smartproxy.com:10001\",\n",
" #proxy=\"https://jobspy:5a4vpWtj4EeJ2hoYzk@us.smartproxy.com:10001\",\n",
")\n",
"display(jobs)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6a581b2d-f7da-4fac-868d-9efe143ee20a",
"metadata": {},
"outputs": [],
"source": [
"# example 2 - remote USA & hyperlinks\n",
"jobs = scrape_jobs(\n",
" site_name=[\"linkedin\", \"zip_recruiter\", \"indeed\"],\n",
" # location='san francisco',\n",
" search_term=\"software engineer\",\n",
" country_indeed=\"USA\",\n",
" hyperlinks=True,\n",
" is_remote=True,\n",
" results_wanted=5, \n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fe8289bc-5b64-4202-9a64-7c117c83fd9a",
"metadata": {},
"outputs": [],
"source": [
"# use if hyperlinks=True\n",
"html = jobs.to_html(escape=False)\n",
"# change max-width: 200px to show more or less of the content\n",
"truncate_width = f'<style>.dataframe td {{ max-width: 200px; overflow: hidden; text-overflow: ellipsis; white-space: nowrap; }}</style>{html}'\n",
"display(HTML(truncate_width))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "951c2fe1-52ff-407d-8bb1-068049b36777",
"metadata": {},
"outputs": [],
"source": [
"# example 3 - with hyperlinks, international - linkedin (no zip_recruiter)\n",
"jobs = scrape_jobs(\n",
" site_name=[\"linkedin\"],\n",
" location='berlin',\n",
" search_term=\"engineer\",\n",
" hyperlinks=True,\n",
" results_wanted=5,\n",
" easy_apply=True\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1e37a521-caef-441c-8fc2-2eb5b2e7da62",
"metadata": {},
"outputs": [],
"source": [
"# use if hyperlinks=True\n",
"html = jobs.to_html(escape=False)\n",
"# change max-width: 200px to show more or less of the content\n",
"truncate_width = f'<style>.dataframe td {{ max-width: 200px; overflow: hidden; text-overflow: ellipsis; white-space: nowrap; }}</style>{html}'\n",
"display(HTML(truncate_width))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0650e608-0b58-4bf5-ae86-68348035b16a",
"metadata": {},
"outputs": [],
"source": [
"# example 4 - international indeed (no zip_recruiter)\n",
"jobs = scrape_jobs(\n",
" site_name=[\"indeed\"],\n",
" search_term=\"engineer\",\n",
" country_indeed = \"China\",\n",
" hyperlinks=True\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "40913ac8-3f8a-4d7e-ac47-afb88316432b",
"metadata": {},
"outputs": [],
"source": [
"# use if hyperlinks=True\n",
"html = jobs.to_html(escape=False)\n",
"# change max-width: 200px to show more or less of the content\n",
"truncate_width = f'<style>.dataframe td {{ max-width: 200px; overflow: hidden; text-overflow: ellipsis; white-space: nowrap; }}</style>{html}'\n",
"display(HTML(truncate_width))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -1,78 +0,0 @@
from jobspy import scrape_jobs
import pandas as pd
import os
import time
# creates csv a new filename if the jobs.csv already exists.
csv_filename = "jobs.csv"
counter = 1
while os.path.exists(csv_filename):
csv_filename = f"jobs_{counter}.csv"
counter += 1
# results wanted and offset
results_wanted = 1000
offset = 0
all_jobs = []
# max retries
max_retries = 3
# nuumber of results at each iteration
results_in_each_iteration = 30
while len(all_jobs) < results_wanted:
retry_count = 0
while retry_count < max_retries:
print("Doing from", offset, "to", offset + results_in_each_iteration, "jobs")
try:
jobs = scrape_jobs(
site_name=["indeed"],
search_term="software engineer",
# New York, NY
# Dallas, TX
# Los Angeles, CA
location="Los Angeles, CA",
results_wanted=min(
results_in_each_iteration, results_wanted - len(all_jobs)
),
country_indeed="USA",
offset=offset,
# proxy="http://jobspy:5a4vpWtj8EeJ2hoYzk@ca.smartproxy.com:20001",
)
# Add the scraped jobs to the list
all_jobs.extend(jobs.to_dict("records"))
# Increment the offset for the next page of results
offset += results_in_each_iteration
# Add a delay to avoid rate limiting (you can adjust the delay time as needed)
print(f"Scraped {len(all_jobs)} jobs")
print("Sleeping secs", 100 * (retry_count + 1))
time.sleep(100 * (retry_count + 1)) # Sleep for 2 seconds between requests
break # Break out of the retry loop if successful
except Exception as e:
print(f"Error: {e}")
retry_count += 1
print("Sleeping secs before retry", 100 * (retry_count + 1))
time.sleep(100 * (retry_count + 1))
if retry_count >= max_retries:
print("Max retries reached. Exiting.")
break
# DataFrame from the collected job data
jobs_df = pd.DataFrame(all_jobs)
# Formatting
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option("display.width", None)
pd.set_option("display.max_colwidth", 50)
print(jobs_df)
jobs_df.to_csv(csv_filename, index=False)
print(f"Outputted to {csv_filename}")

View File

@ -1,27 +1,34 @@
from __future__ import annotations
import pandas as pd
from typing import Tuple
from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import Tuple
from .jobs import JobType, Location
from .scrapers.utils import logger, set_logger_level
from .scrapers.indeed import IndeedScraper
from .scrapers.ziprecruiter import ZipRecruiterScraper
from .scrapers.glassdoor import GlassdoorScraper
from .scrapers.linkedin import LinkedInScraper
from .scrapers import ScraperInput, Site, JobResponse, Country
from .scrapers.exceptions import (
LinkedInException,
IndeedException,
ZipRecruiterException,
GlassdoorException,
import pandas as pd
from jobspy.bayt import BaytScraper
from jobspy.glassdoor import Glassdoor
from jobspy.google import Google
from jobspy.indeed import Indeed
from jobspy.linkedin import LinkedIn
from jobspy.naukri import Naukri
from jobspy.model import JobType, Location, JobResponse, Country
from jobspy.model import SalarySource, ScraperInput, Site
from jobspy.util import (
set_logger_level,
extract_salary,
create_logger,
get_enum_from_value,
map_str_to_site,
convert_to_annual,
desired_order,
)
from jobspy.ziprecruiter import ZipRecruiter
def scrape_jobs(
site_name: str | list[str] | Site | list[Site] | None = None,
search_term: str | None = None,
google_search_term: str | None = None,
location: str | None = None,
distance: int | None = 50,
is_remote: bool = False,
@ -29,37 +36,31 @@ def scrape_jobs(
easy_apply: bool | None = None,
results_wanted: int = 15,
country_indeed: str = "usa",
hyperlinks: bool = False,
proxy: str | None = None,
proxies: list[str] | str | None = None,
ca_cert: str | None = None,
description_format: str = "markdown",
linkedin_fetch_description: bool | None = False,
linkedin_company_ids: list[int] | None = None,
offset: int | None = 0,
hours_old: int = None,
verbose: int = 2,
enforce_annual_salary: bool = False,
verbose: int = 0,
**kwargs,
) -> pd.DataFrame:
"""
Simultaneously scrapes job data from multiple job sites.
:return: pandas dataframe containing job data
Scrapes job data from job boards concurrently
:return: Pandas DataFrame containing job data
"""
SCRAPER_MAPPING = {
Site.LINKEDIN: LinkedInScraper,
Site.INDEED: IndeedScraper,
Site.ZIP_RECRUITER: ZipRecruiterScraper,
Site.GLASSDOOR: GlassdoorScraper,
Site.LINKEDIN: LinkedIn,
Site.INDEED: Indeed,
Site.ZIP_RECRUITER: ZipRecruiter,
Site.GLASSDOOR: Glassdoor,
Site.GOOGLE: Google,
Site.BAYT: BaytScraper,
Site.NAUKRI: Naukri,
}
set_logger_level(verbose)
def map_str_to_site(site_name: str) -> Site:
return Site[site_name.upper()]
def get_enum_from_value(value_str):
for job_type in JobType:
if value_str in job_type.value:
return job_type
raise Exception(f"Invalid job type: {value_str}")
job_type = get_enum_from_value(job_type) if job_type else None
def get_site_type():
@ -81,6 +82,7 @@ def scrape_jobs(
site_type=get_site_type(),
country=country_enum,
search_term=search_term,
google_search_term=google_search_term,
location=location,
distance=distance,
is_remote=is_remote,
@ -96,11 +98,11 @@ def scrape_jobs(
def scrape_site(site: Site) -> Tuple[str, JobResponse]:
scraper_class = SCRAPER_MAPPING[site]
scraper = scraper_class(proxy=proxy)
scraper = scraper_class(proxies=proxies, ca_cert=ca_cert)
scraped_data: JobResponse = scraper.scrape(scraper_input)
cap_name = site.value.capitalize()
site_name = "ZipRecruiter" if cap_name == "Zip_recruiter" else cap_name
logger.info(f"{site_name} finished scraping")
create_logger(site_name).info(f"finished scraping")
return site.value, scraped_data
site_to_jobs_dict = {}
@ -124,7 +126,6 @@ def scrape_jobs(
for job in job_response.jobs:
job_data = job.dict()
job_url = job_data["job_url"]
job_data["job_url_hyper"] = f'<a href="{job_url}">{job_url}</a>'
job_data["site"] = site
job_data["company"] = job_data["company_name"]
job_data["job_type"] = (
@ -140,6 +141,7 @@ def scrape_jobs(
**job_data["location"]
).display_location()
# Handle compensation
compensation_obj = job_data.get("compensation")
if compensation_obj and isinstance(compensation_obj, dict):
job_data["interval"] = (
@ -150,11 +152,42 @@ def scrape_jobs(
job_data["min_amount"] = compensation_obj.get("min_amount")
job_data["max_amount"] = compensation_obj.get("max_amount")
job_data["currency"] = compensation_obj.get("currency", "USD")
job_data["salary_source"] = SalarySource.DIRECT_DATA.value
if enforce_annual_salary and (
job_data["interval"]
and job_data["interval"] != "yearly"
and job_data["min_amount"]
and job_data["max_amount"]
):
convert_to_annual(job_data)
else:
job_data["interval"] = None
job_data["min_amount"] = None
job_data["max_amount"] = None
job_data["currency"] = None
if country_enum == Country.USA:
(
job_data["interval"],
job_data["min_amount"],
job_data["max_amount"],
job_data["currency"],
) = extract_salary(
job_data["description"],
enforce_annual_salary=enforce_annual_salary,
)
job_data["salary_source"] = SalarySource.DESCRIPTION.value
job_data["salary_source"] = (
job_data["salary_source"]
if "min_amount" in job_data and job_data["min_amount"]
else None
)
#naukri-specific fields
job_data["skills"] = (
", ".join(job_data["skills"]) if job_data["skills"] else None
)
job_data["experience_range"] = job_data.get("experience_range")
job_data["company_rating"] = job_data.get("company_rating")
job_data["company_reviews_count"] = job_data.get("company_reviews_count")
job_data["vacancy_count"] = job_data.get("vacancy_count")
job_data["work_from_home_type"] = job_data.get("work_from_home_type")
job_df = pd.DataFrame([job_data])
jobs_dfs.append(job_df)
@ -166,37 +199,6 @@ def scrape_jobs(
# Step 2: Concatenate the filtered DataFrames
jobs_df = pd.concat(filtered_dfs, ignore_index=True)
# Desired column order
desired_order = [
"id",
"site",
"job_url_hyper" if hyperlinks else "job_url",
"job_url_direct",
"title",
"company",
"location",
"job_type",
"date_posted",
"interval",
"min_amount",
"max_amount",
"currency",
"is_remote",
"emails",
"description",
"company_url",
"company_url_direct",
"company_addresses",
"company_industry",
"company_num_employees",
"company_revenue",
"company_description",
"logo_photo_url",
"banner_photo_url",
"ceo_name",
"ceo_photo_url",
]
# Step 3: Ensure all desired columns are present, adding missing ones as empty
for column in desired_order:
if column not in jobs_df.columns:
@ -206,6 +208,8 @@ def scrape_jobs(
jobs_df = jobs_df[desired_order]
# Step 4: Sort the DataFrame as required
return jobs_df.sort_values(by=["site", "date_posted"], ascending=[True, False])
return jobs_df.sort_values(
by=["site", "date_posted"], ascending=[True, False]
).reset_index(drop=True)
else:
return pd.DataFrame()

145
jobspy/bayt/__init__.py Normal file
View File

@ -0,0 +1,145 @@
from __future__ import annotations
import random
import time
from bs4 import BeautifulSoup
from jobspy.model import (
Scraper,
ScraperInput,
Site,
JobPost,
JobResponse,
Location,
Country,
)
from jobspy.util import create_logger, create_session
log = create_logger("Bayt")
class BaytScraper(Scraper):
base_url = "https://www.bayt.com"
delay = 2
band_delay = 3
def __init__(
self, proxies: list[str] | str | None = None, ca_cert: str | None = None
):
super().__init__(Site.BAYT, proxies=proxies, ca_cert=ca_cert)
self.scraper_input = None
self.session = None
self.country = "worldwide"
def scrape(self, scraper_input: ScraperInput) -> JobResponse:
self.scraper_input = scraper_input
self.session = create_session(
proxies=self.proxies, ca_cert=self.ca_cert, is_tls=False, has_retry=True
)
job_list: list[JobPost] = []
page = 1
results_wanted = (
scraper_input.results_wanted if scraper_input.results_wanted else 10
)
while len(job_list) < results_wanted:
log.info(f"Fetching Bayt jobs page {page}")
job_elements = self._fetch_jobs(self.scraper_input.search_term, page)
if not job_elements:
break
if job_elements:
log.debug(
"First job element snippet:\n" + job_elements[0].prettify()[:500]
)
initial_count = len(job_list)
for job in job_elements:
try:
job_post = self._extract_job_info(job)
if job_post:
job_list.append(job_post)
if len(job_list) >= results_wanted:
break
else:
log.debug(
"Extraction returned None. Job snippet:\n"
+ job.prettify()[:500]
)
except Exception as e:
log.error(f"Bayt: Error extracting job info: {str(e)}")
continue
if len(job_list) == initial_count:
log.info(f"No new jobs found on page {page}. Ending pagination.")
break
page += 1
time.sleep(random.uniform(self.delay, self.delay + self.band_delay))
job_list = job_list[: scraper_input.results_wanted]
return JobResponse(jobs=job_list)
def _fetch_jobs(self, query: str, page: int) -> list | None:
"""
Grabs the job results for the given query and page number.
"""
try:
url = f"{self.base_url}/en/international/jobs/{query}-jobs/?page={page}"
response = self.session.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
job_listings = soup.find_all("li", attrs={"data-js-job": ""})
log.debug(f"Found {len(job_listings)} job listing elements")
return job_listings
except Exception as e:
log.error(f"Bayt: Error fetching jobs - {str(e)}")
return None
def _extract_job_info(self, job: BeautifulSoup) -> JobPost | None:
"""
Extracts the job information from a single job listing.
"""
# Find the h2 element holding the title and link (no class filtering)
job_general_information = job.find("h2")
if not job_general_information:
return
job_title = job_general_information.get_text(strip=True)
job_url = self._extract_job_url(job_general_information)
if not job_url:
return
# Extract company name using the original approach:
company_tag = job.find("div", class_="t-nowrap p10l")
company_name = (
company_tag.find("span").get_text(strip=True)
if company_tag and company_tag.find("span")
else None
)
# Extract location using the original approach:
location_tag = job.find("div", class_="t-mute t-small")
location = location_tag.get_text(strip=True) if location_tag else None
job_id = f"bayt-{abs(hash(job_url))}"
location_obj = Location(
city=location,
country=Country.from_string(self.country),
)
return JobPost(
id=job_id,
title=job_title,
company_name=company_name,
location=location_obj,
job_url=job_url,
)
def _extract_job_url(self, job_general_information: BeautifulSoup) -> str | None:
"""
Pulls the job URL from the 'a' within the h2 element.
"""
a_tag = job_general_information.find("a")
if a_tag and a_tag.has_attr("href"):
return self.base_url + a_tag["href"].strip()

View File

@ -1,5 +1,5 @@
"""
jobspy.scrapers.exceptions
jobspy.jobboard.exceptions
~~~~~~~~~~~~~~~~~~~
This module contains the set of Scrapers' exceptions.
@ -24,3 +24,17 @@ class ZipRecruiterException(Exception):
class GlassdoorException(Exception):
def __init__(self, message=None):
super().__init__(message or "An error occurred with Glassdoor")
class GoogleJobsException(Exception):
def __init__(self, message=None):
super().__init__(message or "An error occurred with Google Jobs")
class BaytException(Exception):
def __init__(self, message=None):
super().__init__(message or "An error occurred with Bayt")
class NaukriException(Exception):
def __init__(self,message=None):
super().__init__(message or "An error occurred with Naukri")

View File

@ -0,0 +1,320 @@
from __future__ import annotations
import re
import json
import requests
from typing import Tuple
from datetime import datetime, timedelta
from concurrent.futures import ThreadPoolExecutor, as_completed
from jobspy.glassdoor.constant import fallback_token, query_template, headers
from jobspy.glassdoor.util import (
get_cursor_for_page,
parse_compensation,
parse_location,
)
from jobspy.util import (
extract_emails_from_text,
create_logger,
create_session,
markdown_converter,
)
from jobspy.exception import GlassdoorException
from jobspy.model import (
JobPost,
JobResponse,
DescriptionFormat,
Scraper,
ScraperInput,
Site,
)
log = create_logger("Glassdoor")
class Glassdoor(Scraper):
def __init__(
self, proxies: list[str] | str | None = None, ca_cert: str | None = None
):
"""
Initializes GlassdoorScraper with the Glassdoor job search url
"""
site = Site(Site.GLASSDOOR)
super().__init__(site, proxies=proxies, ca_cert=ca_cert)
self.base_url = None
self.country = None
self.session = None
self.scraper_input = None
self.jobs_per_page = 30
self.max_pages = 30
self.seen_urls = set()
def scrape(self, scraper_input: ScraperInput) -> JobResponse:
"""
Scrapes Glassdoor for jobs with scraper_input criteria.
:param scraper_input: Information about job search criteria.
:return: JobResponse containing a list of jobs.
"""
self.scraper_input = scraper_input
self.scraper_input.results_wanted = min(900, scraper_input.results_wanted)
self.base_url = self.scraper_input.country.get_glassdoor_url()
self.session = create_session(
proxies=self.proxies, ca_cert=self.ca_cert, has_retry=True
)
token = self._get_csrf_token()
headers["gd-csrf-token"] = token if token else fallback_token
self.session.headers.update(headers)
location_id, location_type = self._get_location(
scraper_input.location, scraper_input.is_remote
)
if location_type is None:
log.error("Glassdoor: location not parsed")
return JobResponse(jobs=[])
job_list: list[JobPost] = []
cursor = None
range_start = 1 + (scraper_input.offset // self.jobs_per_page)
tot_pages = (scraper_input.results_wanted // self.jobs_per_page) + 2
range_end = min(tot_pages, self.max_pages + 1)
for page in range(range_start, range_end):
log.info(f"search page: {page} / {range_end - 1}")
try:
jobs, cursor = self._fetch_jobs_page(
scraper_input, location_id, location_type, page, cursor
)
job_list.extend(jobs)
if not jobs or len(job_list) >= scraper_input.results_wanted:
job_list = job_list[: scraper_input.results_wanted]
break
except Exception as e:
log.error(f"Glassdoor: {str(e)}")
break
return JobResponse(jobs=job_list)
def _fetch_jobs_page(
self,
scraper_input: ScraperInput,
location_id: int,
location_type: str,
page_num: int,
cursor: str | None,
) -> Tuple[list[JobPost], str | None]:
"""
Scrapes a page of Glassdoor for jobs with scraper_input criteria
"""
jobs = []
self.scraper_input = scraper_input
try:
payload = self._add_payload(location_id, location_type, page_num, cursor)
response = self.session.post(
f"{self.base_url}/graph",
timeout_seconds=15,
data=payload,
)
if response.status_code != 200:
exc_msg = f"bad response status code: {response.status_code}"
raise GlassdoorException(exc_msg)
res_json = response.json()[0]
if "errors" in res_json:
raise ValueError("Error encountered in API response")
except (
requests.exceptions.ReadTimeout,
GlassdoorException,
ValueError,
Exception,
) as e:
log.error(f"Glassdoor: {str(e)}")
return jobs, None
jobs_data = res_json["data"]["jobListings"]["jobListings"]
with ThreadPoolExecutor(max_workers=self.jobs_per_page) as executor:
future_to_job_data = {
executor.submit(self._process_job, job): job for job in jobs_data
}
for future in as_completed(future_to_job_data):
try:
job_post = future.result()
if job_post:
jobs.append(job_post)
except Exception as exc:
raise GlassdoorException(f"Glassdoor generated an exception: {exc}")
return jobs, get_cursor_for_page(
res_json["data"]["jobListings"]["paginationCursors"], page_num + 1
)
def _get_csrf_token(self):
"""
Fetches csrf token needed for API by visiting a generic page
"""
res = self.session.get(f"{self.base_url}/Job/computer-science-jobs.htm")
pattern = r'"token":\s*"([^"]+)"'
matches = re.findall(pattern, res.text)
token = None
if matches:
token = matches[0]
return token
def _process_job(self, job_data):
"""
Processes a single job and fetches its description.
"""
job_id = job_data["jobview"]["job"]["listingId"]
job_url = f"{self.base_url}job-listing/j?jl={job_id}"
if job_url in self.seen_urls:
return None
self.seen_urls.add(job_url)
job = job_data["jobview"]
title = job["job"]["jobTitleText"]
company_name = job["header"]["employerNameFromSearch"]
company_id = job_data["jobview"]["header"]["employer"]["id"]
location_name = job["header"].get("locationName", "")
location_type = job["header"].get("locationType", "")
age_in_days = job["header"].get("ageInDays")
is_remote, location = False, None
date_diff = (datetime.now() - timedelta(days=age_in_days)).date()
date_posted = date_diff if age_in_days is not None else None
if location_type == "S":
is_remote = True
else:
location = parse_location(location_name)
compensation = parse_compensation(job["header"])
try:
description = self._fetch_job_description(job_id)
except:
description = None
company_url = f"{self.base_url}Overview/W-EI_IE{company_id}.htm"
company_logo = (
job_data["jobview"].get("overview", {}).get("squareLogoUrl", None)
)
listing_type = (
job_data["jobview"]
.get("header", {})
.get("adOrderSponsorshipLevel", "")
.lower()
)
return JobPost(
id=f"gd-{job_id}",
title=title,
company_url=company_url if company_id else None,
company_name=company_name,
date_posted=date_posted,
job_url=job_url,
location=location,
compensation=compensation,
is_remote=is_remote,
description=description,
emails=extract_emails_from_text(description) if description else None,
company_logo=company_logo,
listing_type=listing_type,
)
def _fetch_job_description(self, job_id):
"""
Fetches the job description for a single job ID.
"""
url = f"{self.base_url}/graph"
body = [
{
"operationName": "JobDetailQuery",
"variables": {
"jl": job_id,
"queryString": "q",
"pageTypeEnum": "SERP",
},
"query": """
query JobDetailQuery($jl: Long!, $queryString: String, $pageTypeEnum: PageTypeEnum) {
jobview: jobView(
listingId: $jl
contextHolder: {queryString: $queryString, pageTypeEnum: $pageTypeEnum}
) {
job {
description
__typename
}
__typename
}
}
""",
}
]
res = requests.post(url, json=body, headers=headers)
if res.status_code != 200:
return None
data = res.json()[0]
desc = data["data"]["jobview"]["job"]["description"]
if self.scraper_input.description_format == DescriptionFormat.MARKDOWN:
desc = markdown_converter(desc)
return desc
def _get_location(self, location: str, is_remote: bool) -> (int, str):
if not location or is_remote:
return "11047", "STATE" # remote options
url = f"{self.base_url}/findPopularLocationAjax.htm?maxLocationsToReturn=10&term={location}"
res = self.session.get(url)
if res.status_code != 200:
if res.status_code == 429:
err = f"429 Response - Blocked by Glassdoor for too many requests"
log.error(err)
return None, None
else:
err = f"Glassdoor response status code {res.status_code}"
err += f" - {res.text}"
log.error(f"Glassdoor response status code {res.status_code}")
return None, None
items = res.json()
if not items:
raise ValueError(f"Location '{location}' not found on Glassdoor")
location_type = items[0]["locationType"]
if location_type == "C":
location_type = "CITY"
elif location_type == "S":
location_type = "STATE"
elif location_type == "N":
location_type = "COUNTRY"
return int(items[0]["locationId"]), location_type
def _add_payload(
self,
location_id: int,
location_type: str,
page_num: int,
cursor: str | None = None,
) -> str:
fromage = None
if self.scraper_input.hours_old:
fromage = max(self.scraper_input.hours_old // 24, 1)
filter_params = []
if self.scraper_input.easy_apply:
filter_params.append({"filterKey": "applicationType", "values": "1"})
if fromage:
filter_params.append({"filterKey": "fromAge", "values": str(fromage)})
payload = {
"operationName": "JobSearchResultsQuery",
"variables": {
"excludeJobListingIds": [],
"filterParams": filter_params,
"keyword": self.scraper_input.search_term,
"numJobsToShow": 30,
"locationType": location_type,
"locationId": int(location_id),
"parameterUrlInput": f"IL.0,12_I{location_type}{location_id}",
"pageNumber": page_num,
"pageCursor": cursor,
"fromage": fromage,
"sort": "date",
},
"query": query_template,
}
if self.scraper_input.job_type:
payload["variables"]["filterParams"].append(
{"filterKey": "jobType", "values": self.scraper_input.job_type.value[0]}
)
return json.dumps([payload])

View File

@ -0,0 +1,184 @@
headers = {
"authority": "www.glassdoor.com",
"accept": "*/*",
"accept-language": "en-US,en;q=0.9",
"apollographql-client-name": "job-search-next",
"apollographql-client-version": "4.65.5",
"content-type": "application/json",
"origin": "https://www.glassdoor.com",
"referer": "https://www.glassdoor.com/",
"sec-ch-ua": '"Chromium";v="118", "Google Chrome";v="118", "Not=A?Brand";v="99"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"macOS"',
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-origin",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36",
}
query_template = """
query JobSearchResultsQuery(
$excludeJobListingIds: [Long!],
$keyword: String,
$locationId: Int,
$locationType: LocationTypeEnum,
$numJobsToShow: Int!,
$pageCursor: String,
$pageNumber: Int,
$filterParams: [FilterParams],
$originalPageUrl: String,
$seoFriendlyUrlInput: String,
$parameterUrlInput: String,
$seoUrl: Boolean
) {
jobListings(
contextHolder: {
searchParams: {
excludeJobListingIds: $excludeJobListingIds,
keyword: $keyword,
locationId: $locationId,
locationType: $locationType,
numPerPage: $numJobsToShow,
pageCursor: $pageCursor,
pageNumber: $pageNumber,
filterParams: $filterParams,
originalPageUrl: $originalPageUrl,
seoFriendlyUrlInput: $seoFriendlyUrlInput,
parameterUrlInput: $parameterUrlInput,
seoUrl: $seoUrl,
searchType: SR
}
}
) {
companyFilterOptions {
id
shortName
__typename
}
filterOptions
indeedCtk
jobListings {
...JobView
__typename
}
jobListingSeoLinks {
linkItems {
position
url
__typename
}
__typename
}
jobSearchTrackingKey
jobsPageSeoData {
pageMetaDescription
pageTitle
__typename
}
paginationCursors {
cursor
pageNumber
__typename
}
indexablePageForSeo
searchResultsMetadata {
searchCriteria {
implicitLocation {
id
localizedDisplayName
type
__typename
}
keyword
location {
id
shortName
localizedShortName
localizedDisplayName
type
__typename
}
__typename
}
helpCenterDomain
helpCenterLocale
jobSerpJobOutlook {
occupation
paragraph
__typename
}
showMachineReadableJobs
__typename
}
totalJobsCount
__typename
}
}
fragment JobView on JobListingSearchResult {
jobview {
header {
adOrderId
advertiserType
adOrderSponsorshipLevel
ageInDays
divisionEmployerName
easyApply
employer {
id
name
shortName
__typename
}
employerNameFromSearch
goc
gocConfidence
gocId
jobCountryId
jobLink
jobResultTrackingKey
jobTitleText
locationName
locationType
locId
needsCommission
payCurrency
payPeriod
payPeriodAdjustedPay {
p10
p50
p90
__typename
}
rating
salarySource
savedJobId
sponsored
__typename
}
job {
description
importConfigId
jobTitleId
jobTitleText
listingId
__typename
}
jobListingAdminDetails {
cpcVal
importConfigId
jobListingId
jobSourceId
userEligibleForAdminJobDetails
__typename
}
overview {
shortName
squareLogoUrl
__typename
}
__typename
}
__typename
}
"""
fallback_token = "Ft6oHEWlRZrxDww95Cpazw:0pGUrkb2y3TyOpAIqF2vbPmUXoXVkD3oEGDVkvfeCerceQ5-n8mBg3BovySUIjmCPHCaW0H2nQVdqzbtsYqf4Q:wcqRqeegRUa9MVLJGyujVXB7vWFPjdaS1CtrrzJq-ok"

42
jobspy/glassdoor/util.py Normal file
View File

@ -0,0 +1,42 @@
from jobspy.model import Compensation, CompensationInterval, Location, JobType
def parse_compensation(data: dict) -> Compensation | None:
pay_period = data.get("payPeriod")
adjusted_pay = data.get("payPeriodAdjustedPay")
currency = data.get("payCurrency", "USD")
if not pay_period or not adjusted_pay:
return None
interval = None
if pay_period == "ANNUAL":
interval = CompensationInterval.YEARLY
elif pay_period:
interval = CompensationInterval.get_interval(pay_period)
min_amount = int(adjusted_pay.get("p10") // 1)
max_amount = int(adjusted_pay.get("p90") // 1)
return Compensation(
interval=interval,
min_amount=min_amount,
max_amount=max_amount,
currency=currency,
)
def get_job_type_enum(job_type_str: str) -> list[JobType] | None:
for job_type in JobType:
if job_type_str in job_type.value:
return [job_type]
def parse_location(location_name: str) -> Location | None:
if not location_name or location_name == "Remote":
return
city, _, state = location_name.partition(", ")
return Location(city=city, state=state)
def get_cursor_for_page(pagination_cursors, page_num):
for cursor_data in pagination_cursors:
if cursor_data["pageNumber"] == page_num:
return cursor_data["cursor"]

202
jobspy/google/__init__.py Normal file
View File

@ -0,0 +1,202 @@
from __future__ import annotations
import math
import re
import json
from typing import Tuple
from datetime import datetime, timedelta
from jobspy.google.constant import headers_jobs, headers_initial, async_param
from jobspy.model import (
Scraper,
ScraperInput,
Site,
JobPost,
JobResponse,
Location,
JobType,
)
from jobspy.util import extract_emails_from_text, extract_job_type, create_session
from jobspy.google.util import log, find_job_info_initial_page, find_job_info
class Google(Scraper):
def __init__(
self, proxies: list[str] | str | None = None, ca_cert: str | None = None
):
"""
Initializes Google Scraper with the Goodle jobs search url
"""
site = Site(Site.GOOGLE)
super().__init__(site, proxies=proxies, ca_cert=ca_cert)
self.country = None
self.session = None
self.scraper_input = None
self.jobs_per_page = 10
self.seen_urls = set()
self.url = "https://www.google.com/search"
self.jobs_url = "https://www.google.com/async/callback:550"
def scrape(self, scraper_input: ScraperInput) -> JobResponse:
"""
Scrapes Google for jobs with scraper_input criteria.
:param scraper_input: Information about job search criteria.
:return: JobResponse containing a list of jobs.
"""
self.scraper_input = scraper_input
self.scraper_input.results_wanted = min(900, scraper_input.results_wanted)
self.session = create_session(
proxies=self.proxies, ca_cert=self.ca_cert, is_tls=False, has_retry=True
)
forward_cursor, job_list = self._get_initial_cursor_and_jobs()
if forward_cursor is None:
log.warning(
"initial cursor not found, try changing your query or there was at most 10 results"
)
return JobResponse(jobs=job_list)
page = 1
while (
len(self.seen_urls) < scraper_input.results_wanted + scraper_input.offset
and forward_cursor
):
log.info(
f"search page: {page} / {math.ceil(scraper_input.results_wanted / self.jobs_per_page)}"
)
try:
jobs, forward_cursor = self._get_jobs_next_page(forward_cursor)
except Exception as e:
log.error(f"failed to get jobs on page: {page}, {e}")
break
if not jobs:
log.info(f"found no jobs on page: {page}")
break
job_list += jobs
page += 1
return JobResponse(
jobs=job_list[
scraper_input.offset : scraper_input.offset
+ scraper_input.results_wanted
]
)
def _get_initial_cursor_and_jobs(self) -> Tuple[str, list[JobPost]]:
"""Gets initial cursor and jobs to paginate through job listings"""
query = f"{self.scraper_input.search_term} jobs"
def get_time_range(hours_old):
if hours_old <= 24:
return "since yesterday"
elif hours_old <= 72:
return "in the last 3 days"
elif hours_old <= 168:
return "in the last week"
else:
return "in the last month"
job_type_mapping = {
JobType.FULL_TIME: "Full time",
JobType.PART_TIME: "Part time",
JobType.INTERNSHIP: "Internship",
JobType.CONTRACT: "Contract",
}
if self.scraper_input.job_type in job_type_mapping:
query += f" {job_type_mapping[self.scraper_input.job_type]}"
if self.scraper_input.location:
query += f" near {self.scraper_input.location}"
if self.scraper_input.hours_old:
time_filter = get_time_range(self.scraper_input.hours_old)
query += f" {time_filter}"
if self.scraper_input.is_remote:
query += " remote"
if self.scraper_input.google_search_term:
query = self.scraper_input.google_search_term
params = {"q": query, "udm": "8"}
response = self.session.get(self.url, headers=headers_initial, params=params)
pattern_fc = r'<div jsname="Yust4d"[^>]+data-async-fc="([^"]+)"'
match_fc = re.search(pattern_fc, response.text)
data_async_fc = match_fc.group(1) if match_fc else None
jobs_raw = find_job_info_initial_page(response.text)
jobs = []
for job_raw in jobs_raw:
job_post = self._parse_job(job_raw)
if job_post:
jobs.append(job_post)
return data_async_fc, jobs
def _get_jobs_next_page(self, forward_cursor: str) -> Tuple[list[JobPost], str]:
params = {"fc": [forward_cursor], "fcv": ["3"], "async": [async_param]}
response = self.session.get(self.jobs_url, headers=headers_jobs, params=params)
return self._parse_jobs(response.text)
def _parse_jobs(self, job_data: str) -> Tuple[list[JobPost], str]:
"""
Parses jobs on a page with next page cursor
"""
start_idx = job_data.find("[[[")
end_idx = job_data.rindex("]]]") + 3
s = job_data[start_idx:end_idx]
parsed = json.loads(s)[0]
pattern_fc = r'data-async-fc="([^"]+)"'
match_fc = re.search(pattern_fc, job_data)
data_async_fc = match_fc.group(1) if match_fc else None
jobs_on_page = []
for array in parsed:
_, job_data = array
if not job_data.startswith("[[["):
continue
job_d = json.loads(job_data)
job_info = find_job_info(job_d)
job_post = self._parse_job(job_info)
if job_post:
jobs_on_page.append(job_post)
return jobs_on_page, data_async_fc
def _parse_job(self, job_info: list):
job_url = job_info[3][0][0] if job_info[3] and job_info[3][0] else None
if job_url in self.seen_urls:
return
self.seen_urls.add(job_url)
title = job_info[0]
company_name = job_info[1]
location = city = job_info[2]
state = country = date_posted = None
if location and "," in location:
city, state, *country = [*map(lambda x: x.strip(), location.split(","))]
days_ago_str = job_info[12]
if type(days_ago_str) == str:
match = re.search(r"\d+", days_ago_str)
days_ago = int(match.group()) if match else None
date_posted = (datetime.now() - timedelta(days=days_ago)).date()
description = job_info[19]
job_post = JobPost(
id=f"go-{job_info[28]}",
title=title,
company_name=company_name,
location=Location(
city=city, state=state, country=country[0] if country else None
),
job_url=job_url,
date_posted=date_posted,
is_remote="remote" in description.lower() or "wfh" in description.lower(),
description=description,
emails=extract_emails_from_text(description),
job_type=extract_job_type(description),
)
return job_post

52
jobspy/google/constant.py Normal file
View File

@ -0,0 +1,52 @@
headers_initial = {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"accept-language": "en-US,en;q=0.9",
"priority": "u=0, i",
"referer": "https://www.google.com/",
"sec-ch-prefers-color-scheme": "dark",
"sec-ch-ua": '"Chromium";v="130", "Google Chrome";v="130", "Not?A_Brand";v="99"',
"sec-ch-ua-arch": '"arm"',
"sec-ch-ua-bitness": '"64"',
"sec-ch-ua-form-factors": '"Desktop"',
"sec-ch-ua-full-version": '"130.0.6723.58"',
"sec-ch-ua-full-version-list": '"Chromium";v="130.0.6723.58", "Google Chrome";v="130.0.6723.58", "Not?A_Brand";v="99.0.0.0"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-model": '""',
"sec-ch-ua-platform": '"macOS"',
"sec-ch-ua-platform-version": '"15.0.1"',
"sec-ch-ua-wow64": "?0",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36",
"x-browser-channel": "stable",
"x-browser-copyright": "Copyright 2024 Google LLC. All rights reserved.",
"x-browser-year": "2024",
}
headers_jobs = {
"accept": "*/*",
"accept-language": "en-US,en;q=0.9",
"priority": "u=1, i",
"referer": "https://www.google.com/",
"sec-ch-prefers-color-scheme": "dark",
"sec-ch-ua": '"Chromium";v="130", "Google Chrome";v="130", "Not?A_Brand";v="99"',
"sec-ch-ua-arch": '"arm"',
"sec-ch-ua-bitness": '"64"',
"sec-ch-ua-form-factors": '"Desktop"',
"sec-ch-ua-full-version": '"130.0.6723.58"',
"sec-ch-ua-full-version-list": '"Chromium";v="130.0.6723.58", "Google Chrome";v="130.0.6723.58", "Not?A_Brand";v="99.0.0.0"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-model": '""',
"sec-ch-ua-platform": '"macOS"',
"sec-ch-ua-platform-version": '"15.0.1"',
"sec-ch-ua-wow64": "?0",
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-origin",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36",
}
async_param = "_basejs:/xjs/_/js/k=xjs.s.en_US.JwveA-JiKmg.2018.O/am=AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIAAAAAAAAACAAAoICAAAAAAAKMAfAAAAIAQAAAAAAAAAAAAACCAAAEJDAAACAAAAAGABAIAAARBAAABAAAAAgAgQAABAASKAfv8JAAABAAAAAAwAQAQACQAAAAAAcAEAQABoCAAAABAAAIABAACAAAAEAAAAFAAAAAAAAAAAAAAAAAAAAAAAAACAQADoBwAAAAAAAAAAAAAQBAAAAATQAAoACOAHAAAAAAAAAQAAAIIAAAA_ZAACAAAAAAAAcB8APB4wHFJ4AAAAAAAAAAAAAAAACECCYA5If0EACAAAAAAAAAAAAAAAAAAAUgRNXG4AMAE/dg=0/br=1/rs=ACT90oGxMeaFMCopIHq5tuQM-6_3M_VMjQ,_basecss:/xjs/_/ss/k=xjs.s.IwsGu62EDtU.L.B1.O/am=QOoQIAQAAAQAREADEBAAAAAAAAAAAAAAAAAAAAAgAQAAIAAAgAQAAAIAIAIAoEwCAADIC8AfsgEAawwAPkAAjgoAGAAAAAAAAEADAAAAAAIgAECHAAAAAAAAAAABAQAggAARQAAAQCEAAAAAIAAAABgAAAAAIAQIACCAAfB-AAFIQABoCEA_CgEAAIABAACEgHAEwwAEFQAM4CgAAAAAAAAAAAAACABCAAAAQEAAABAgAMCPAAA4AoE2BAEAggSAAIoAQAAAAAgAAAAACCAQAAAxEwA_ZAACAAAAAAAAAAkAAAAAAAAgAAAAAAAAAAAAAAAAAAAAAAAAQAEAAAAAAAAAAAAAAAAAAAAAQA/br=1/rs=ACT90oGZc36t3uUQkj0srnIvvbHjO2hgyg,_basecomb:/xjs/_/js/k=xjs.s.en_US.JwveA-JiKmg.2018.O/ck=xjs.s.IwsGu62EDtU.L.B1.O/am=QOoQIAQAAAQAREADEBAAAAAAAAAAAAAAAAAAAAAgAQAAIAAAgAQAAAKAIAoIqEwCAADIK8AfsgEAawwAPkAAjgoAGAAACCAAAEJDAAACAAIgAGCHAIAAARBAAABBAQAggAgRQABAQSOAfv8JIAABABgAAAwAYAQICSCAAfB-cAFIQABoCEA_ChEAAIABAACEgHAEwwAEFQAM4CgAAAAAAAAAAAAACABCAACAQEDoBxAgAMCPAAA4AoE2BAEAggTQAIoASOAHAAgAAAAACSAQAIIxEwA_ZAACAAAAAAAAcB8APB4wHFJ4AAAAAAAAAAAAAAAACECCYA5If0EACAAAAAAAAAAAAAAAAAAAUgRNXG4AMAE/d=1/ed=1/dg=0/br=1/ujg=1/rs=ACT90oFNLTjPzD_OAqhhtXwe2pg1T3WpBg,_fmt:prog,_id:fc_5FwaZ86OKsfdwN4P4La3yA4_2"

41
jobspy/google/util.py Normal file
View File

@ -0,0 +1,41 @@
import re
from jobspy.util import create_logger
log = create_logger("Google")
def find_job_info(jobs_data: list | dict) -> list | None:
"""Iterates through the JSON data to find the job listings"""
if isinstance(jobs_data, dict):
for key, value in jobs_data.items():
if key == "520084652" and isinstance(value, list):
return value
else:
result = find_job_info(value)
if result:
return result
elif isinstance(jobs_data, list):
for item in jobs_data:
result = find_job_info(item)
if result:
return result
return None
def find_job_info_initial_page(html_text: str):
pattern = f'520084652":(' + r"\[.*?\]\s*])\s*}\s*]\s*]\s*]\s*]\s*]"
results = []
matches = re.finditer(pattern, html_text)
import json
for match in matches:
try:
parsed_data = json.loads(match.group(1))
results.append(parsed_data)
except json.JSONDecodeError as e:
log.error(f"Failed to parse match: {str(e)}")
results.append({"raw_match": match.group(0), "error": str(e)})
return results

260
jobspy/indeed/__init__.py Normal file
View File

@ -0,0 +1,260 @@
from __future__ import annotations
import math
from datetime import datetime
from typing import Tuple
from jobspy.indeed.constant import job_search_query, api_headers
from jobspy.indeed.util import is_job_remote, get_compensation, get_job_type
from jobspy.model import (
Scraper,
ScraperInput,
Site,
JobPost,
Location,
JobResponse,
JobType,
DescriptionFormat,
)
from jobspy.util import (
extract_emails_from_text,
markdown_converter,
create_session,
create_logger,
)
log = create_logger("Indeed")
class Indeed(Scraper):
def __init__(
self, proxies: list[str] | str | None = None, ca_cert: str | None = None
):
"""
Initializes IndeedScraper with the Indeed API url
"""
super().__init__(Site.INDEED, proxies=proxies)
self.session = create_session(
proxies=self.proxies, ca_cert=ca_cert, is_tls=False
)
self.scraper_input = None
self.jobs_per_page = 100
self.num_workers = 10
self.seen_urls = set()
self.headers = None
self.api_country_code = None
self.base_url = None
self.api_url = "https://apis.indeed.com/graphql"
def scrape(self, scraper_input: ScraperInput) -> JobResponse:
"""
Scrapes Indeed for jobs with scraper_input criteria
:param scraper_input:
:return: job_response
"""
self.scraper_input = scraper_input
domain, self.api_country_code = self.scraper_input.country.indeed_domain_value
self.base_url = f"https://{domain}.indeed.com"
self.headers = api_headers.copy()
self.headers["indeed-co"] = self.scraper_input.country.indeed_domain_value
job_list = []
page = 1
cursor = None
while len(self.seen_urls) < scraper_input.results_wanted + scraper_input.offset:
log.info(
f"search page: {page} / {math.ceil(scraper_input.results_wanted / self.jobs_per_page)}"
)
jobs, cursor = self._scrape_page(cursor)
if not jobs:
log.info(f"found no jobs on page: {page}")
break
job_list += jobs
page += 1
return JobResponse(
jobs=job_list[
scraper_input.offset : scraper_input.offset
+ scraper_input.results_wanted
]
)
def _scrape_page(self, cursor: str | None) -> Tuple[list[JobPost], str | None]:
"""
Scrapes a page of Indeed for jobs with scraper_input criteria
:param cursor:
:return: jobs found on page, next page cursor
"""
jobs = []
new_cursor = None
filters = self._build_filters()
search_term = (
self.scraper_input.search_term.replace('"', '\\"')
if self.scraper_input.search_term
else ""
)
query = job_search_query.format(
what=(f'what: "{search_term}"' if search_term else ""),
location=(
f'location: {{where: "{self.scraper_input.location}", radius: {self.scraper_input.distance}, radiusUnit: MILES}}'
if self.scraper_input.location
else ""
),
dateOnIndeed=self.scraper_input.hours_old,
cursor=f'cursor: "{cursor}"' if cursor else "",
filters=filters,
)
payload = {
"query": query,
}
api_headers_temp = api_headers.copy()
api_headers_temp["indeed-co"] = self.api_country_code
response = self.session.post(
self.api_url,
headers=api_headers_temp,
json=payload,
timeout=10,
verify=False,
)
if not response.ok:
log.info(
f"responded with status code: {response.status_code} (submit GitHub issue if this appears to be a bug)"
)
return jobs, new_cursor
data = response.json()
jobs = data["data"]["jobSearch"]["results"]
new_cursor = data["data"]["jobSearch"]["pageInfo"]["nextCursor"]
job_list = []
for job in jobs:
processed_job = self._process_job(job["job"])
if processed_job:
job_list.append(processed_job)
return job_list, new_cursor
def _build_filters(self):
"""
Builds the filters dict for job type/is_remote. If hours_old is provided, composite filter for job_type/is_remote is not possible.
IndeedApply: filters: { keyword: { field: "indeedApplyScope", keys: ["DESKTOP"] } }
"""
filters_str = ""
if self.scraper_input.hours_old:
filters_str = """
filters: {{
date: {{
field: "dateOnIndeed",
start: "{start}h"
}}
}}
""".format(
start=self.scraper_input.hours_old
)
elif self.scraper_input.easy_apply:
filters_str = """
filters: {
keyword: {
field: "indeedApplyScope",
keys: ["DESKTOP"]
}
}
"""
elif self.scraper_input.job_type or self.scraper_input.is_remote:
job_type_key_mapping = {
JobType.FULL_TIME: "CF3CP",
JobType.PART_TIME: "75GKK",
JobType.CONTRACT: "NJXCK",
JobType.INTERNSHIP: "VDTG7",
}
keys = []
if self.scraper_input.job_type:
key = job_type_key_mapping[self.scraper_input.job_type]
keys.append(key)
if self.scraper_input.is_remote:
keys.append("DSQF7")
if keys:
keys_str = '", "'.join(keys)
filters_str = f"""
filters: {{
composite: {{
filters: [{{
keyword: {{
field: "attributes",
keys: ["{keys_str}"]
}}
}}]
}}
}}
"""
return filters_str
def _process_job(self, job: dict) -> JobPost | None:
"""
Parses the job dict into JobPost model
:param job: dict to parse
:return: JobPost if it's a new job
"""
job_url = f'{self.base_url}/viewjob?jk={job["key"]}'
if job_url in self.seen_urls:
return
self.seen_urls.add(job_url)
description = job["description"]["html"]
if self.scraper_input.description_format == DescriptionFormat.MARKDOWN:
description = markdown_converter(description)
job_type = get_job_type(job["attributes"])
timestamp_seconds = job["datePublished"] / 1000
date_posted = datetime.fromtimestamp(timestamp_seconds).strftime("%Y-%m-%d")
employer = job["employer"].get("dossier") if job["employer"] else None
employer_details = employer.get("employerDetails", {}) if employer else {}
rel_url = job["employer"]["relativeCompanyPageUrl"] if job["employer"] else None
return JobPost(
id=f'in-{job["key"]}',
title=job["title"],
description=description,
company_name=job["employer"].get("name") if job.get("employer") else None,
company_url=(f"{self.base_url}{rel_url}" if job["employer"] else None),
company_url_direct=(
employer["links"]["corporateWebsite"] if employer else None
),
location=Location(
city=job.get("location", {}).get("city"),
state=job.get("location", {}).get("admin1Code"),
country=job.get("location", {}).get("countryCode"),
),
job_type=job_type,
compensation=get_compensation(job["compensation"]),
date_posted=date_posted,
job_url=job_url,
job_url_direct=(
job["recruit"].get("viewJobUrl") if job.get("recruit") else None
),
emails=extract_emails_from_text(description) if description else None,
is_remote=is_job_remote(job, description),
company_addresses=(
employer_details["addresses"][0]
if employer_details.get("addresses")
else None
),
company_industry=(
employer_details["industry"]
.replace("Iv1", "")
.replace("_", " ")
.title()
.strip()
if employer_details.get("industry")
else None
),
company_num_employees=employer_details.get("employeesLocalizedLabel"),
company_revenue=employer_details.get("revenueLocalizedLabel"),
company_description=employer_details.get("briefDescription"),
company_logo=(
employer["images"].get("squareLogoUrl")
if employer and employer.get("images")
else None
),
)

109
jobspy/indeed/constant.py Normal file
View File

@ -0,0 +1,109 @@
job_search_query = """
query GetJobData {{
jobSearch(
{what}
{location}
limit: 100
{cursor}
sort: RELEVANCE
{filters}
) {{
pageInfo {{
nextCursor
}}
results {{
trackingKey
job {{
source {{
name
}}
key
title
datePublished
dateOnIndeed
description {{
html
}}
location {{
countryName
countryCode
admin1Code
city
postalCode
streetAddress
formatted {{
short
long
}}
}}
compensation {{
estimated {{
currencyCode
baseSalary {{
unitOfWork
range {{
... on Range {{
min
max
}}
}}
}}
}}
baseSalary {{
unitOfWork
range {{
... on Range {{
min
max
}}
}}
}}
currencyCode
}}
attributes {{
key
label
}}
employer {{
relativeCompanyPageUrl
name
dossier {{
employerDetails {{
addresses
industry
employeesLocalizedLabel
revenueLocalizedLabel
briefDescription
ceoName
ceoPhotoUrl
}}
images {{
headerImageUrl
squareLogoUrl
}}
links {{
corporateWebsite
}}
}}
}}
recruit {{
viewJobUrl
detailedSalary
workSchedule
}}
}}
}}
}}
}}
"""
api_headers = {
"Host": "apis.indeed.com",
"content-type": "application/json",
"indeed-api-key": "161092c2017b5bbab13edb12461a62d5a833871e7cad6d9d475304573de67ac8",
"accept": "application/json",
"indeed-locale": "en-US",
"accept-language": "en-US,en;q=0.9",
"user-agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 16_6_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148 Indeed App 193.1",
"indeed-app-info": "appv=193.1; appid=com.indeed.jobsearch; osv=16.6.1; os=ios; dtype=phone",
}

83
jobspy/indeed/util.py Normal file
View File

@ -0,0 +1,83 @@
from jobspy.model import CompensationInterval, JobType, Compensation
from jobspy.util import get_enum_from_job_type
def get_job_type(attributes: list) -> list[JobType]:
"""
Parses the attributes to get list of job types
:param attributes:
:return: list of JobType
"""
job_types: list[JobType] = []
for attribute in attributes:
job_type_str = attribute["label"].replace("-", "").replace(" ", "").lower()
job_type = get_enum_from_job_type(job_type_str)
if job_type:
job_types.append(job_type)
return job_types
def get_compensation(compensation: dict) -> Compensation | None:
"""
Parses the job to get compensation
:param compensation:
:return: compensation object
"""
if not compensation["baseSalary"] and not compensation["estimated"]:
return None
comp = (
compensation["baseSalary"]
if compensation["baseSalary"]
else compensation["estimated"]["baseSalary"]
)
if not comp:
return None
interval = get_compensation_interval(comp["unitOfWork"])
if not interval:
return None
min_range = comp["range"].get("min")
max_range = comp["range"].get("max")
return Compensation(
interval=interval,
min_amount=int(min_range) if min_range is not None else None,
max_amount=int(max_range) if max_range is not None else None,
currency=(
compensation["estimated"]["currencyCode"]
if compensation["estimated"]
else compensation["currencyCode"]
),
)
def is_job_remote(job: dict, description: str) -> bool:
"""
Searches the description, location, and attributes to check if job is remote
"""
remote_keywords = ["remote", "work from home", "wfh"]
is_remote_in_attributes = any(
any(keyword in attr["label"].lower() for keyword in remote_keywords)
for attr in job["attributes"]
)
is_remote_in_description = any(
keyword in description.lower() for keyword in remote_keywords
)
is_remote_in_location = any(
keyword in job["location"]["formatted"]["long"].lower()
for keyword in remote_keywords
)
return is_remote_in_attributes or is_remote_in_description or is_remote_in_location
def get_compensation_interval(interval: str) -> CompensationInterval:
interval_mapping = {
"DAY": "DAILY",
"YEAR": "YEARLY",
"HOUR": "HOURLY",
"WEEK": "WEEKLY",
"MONTH": "MONTHLY",
}
mapped_interval = interval_mapping.get(interval.upper(), None)
if mapped_interval and mapped_interval in CompensationInterval.__members__:
return CompensationInterval[mapped_interval]
else:
raise ValueError(f"Unsupported interval: {interval}")

View File

@ -1,56 +1,70 @@
"""
jobspy.scrapers.linkedin
~~~~~~~~~~~~~~~~~~~
This module contains routines to scrape LinkedIn.
"""
from __future__ import annotations
import time
import math
import random
import regex as re
import urllib.parse
from typing import Optional
import time
from datetime import datetime
from typing import Optional
from urllib.parse import urlparse, urlunparse, unquote
from threading import Lock
from bs4.element import Tag
import regex as re
from bs4 import BeautifulSoup
from urllib.parse import urlparse, urlunparse
from bs4.element import Tag
from .. import Scraper, ScraperInput, Site
from ..exceptions import LinkedInException
from ..utils import create_session
from ...jobs import (
from jobspy.exception import LinkedInException
from jobspy.linkedin.constant import headers
from jobspy.linkedin.util import (
is_job_remote,
job_type_code,
parse_job_type,
parse_job_level,
parse_company_industry
)
from jobspy.model import (
JobPost,
Location,
JobResponse,
JobType,
Country,
Compensation,
DescriptionFormat,
Scraper,
ScraperInput,
Site,
)
from ..utils import (
logger,
from jobspy.util import (
extract_emails_from_text,
get_enum_from_job_type,
currency_parser,
markdown_converter,
create_session,
remove_attributes,
create_logger,
)
log = create_logger("LinkedIn")
class LinkedInScraper(Scraper):
class LinkedIn(Scraper):
base_url = "https://www.linkedin.com"
delay = 3
band_delay = 4
jobs_per_page = 25
def __init__(self, proxy: Optional[str] = None):
def __init__(
self, proxies: list[str] | str | None = None, ca_cert: str | None = None
):
"""
Initializes LinkedInScraper with the LinkedIn job search url
"""
super().__init__(Site(Site.LINKEDIN), proxy=proxy)
super().__init__(Site.LINKEDIN, proxies=proxies, ca_cert=ca_cert)
self.session = create_session(
proxies=self.proxies,
ca_cert=ca_cert,
is_tls=False,
has_retry=True,
delay=5,
clear_cookies=True,
)
self.session.headers.update(headers)
self.scraper_input = None
self.country = "worldwide"
self.job_url_direct_regex = re.compile(r'(?<=\?url=)[^"]+')
@ -63,30 +77,32 @@ class LinkedInScraper(Scraper):
"""
self.scraper_input = scraper_input
job_list: list[JobPost] = []
seen_urls = set()
url_lock = Lock()
page = scraper_input.offset // 25 + 25 if scraper_input.offset else 0
seen_ids = set()
start = scraper_input.offset // 10 * 10 if scraper_input.offset else 0
request_count = 0
seconds_old = (
scraper_input.hours_old * 3600 if scraper_input.hours_old else None
)
continue_search = (
lambda: len(job_list) < scraper_input.results_wanted and page < 1000
lambda: len(job_list) < scraper_input.results_wanted and start < 1000
)
while continue_search():
logger.info(f"LinkedIn search page: {page // 25 + 1}")
session = create_session(is_tls=False, has_retry=True, delay=5)
request_count += 1
log.info(
f"search page: {request_count} / {math.ceil(scraper_input.results_wanted / 10)}"
)
params = {
"keywords": scraper_input.search_term,
"location": scraper_input.location,
"distance": scraper_input.distance,
"f_WT": 2 if scraper_input.is_remote else None,
"f_JT": (
self.job_type_code(scraper_input.job_type)
job_type_code(scraper_input.job_type)
if scraper_input.job_type
else None
),
"pageNum": 0,
"start": page + scraper_input.offset,
"start": start,
"f_AL": "true" if scraper_input.easy_apply else None,
"f_C": (
",".join(map(str, scraper_input.linkedin_company_ids))
@ -99,12 +115,9 @@ class LinkedInScraper(Scraper):
params = {k: v for k, v in params.items() if v is not None}
try:
response = session.get(
response = self.session.get(
f"{self.base_url}/jobs-guest/jobs/api/seeMoreJobPostings/search?",
params=params,
allow_redirects=True,
proxies=self.proxy,
headers=self.headers,
timeout=10,
)
if response.status_code not in range(200, 400):
@ -115,13 +128,13 @@ class LinkedInScraper(Scraper):
else:
err = f"LinkedIn response status code {response.status_code}"
err += f" - {response.text}"
logger.error(err)
log.error(err)
return JobResponse(jobs=job_list)
except Exception as e:
if "Proxy responded with" in str(e):
logger.error(f"LinkedIn: Bad proxy")
log.error(f"LinkedIn: Bad proxy")
else:
logger.error(f"LinkedIn: {str(e)}")
log.error(f"LinkedIn: {str(e)}")
return JobResponse(jobs=job_list)
soup = BeautifulSoup(response.text, "html.parser")
@ -130,20 +143,18 @@ class LinkedInScraper(Scraper):
return JobResponse(jobs=job_list)
for job_card in job_cards:
job_url = None
href_tag = job_card.find("a", class_="base-card__full-link")
if href_tag and "href" in href_tag.attrs:
href = href_tag.attrs["href"].split("?")[0]
job_id = href.split("-")[-1]
job_url = f"{self.base_url}/jobs/view/{job_id}"
with url_lock:
if job_url in seen_urls:
if job_id in seen_ids:
continue
seen_urls.add(job_url)
seen_ids.add(job_id)
try:
fetch_desc = scraper_input.linkedin_fetch_description
job_post = self._process_job(job_card, job_url, fetch_desc)
job_post = self._process_job(job_card, job_id, fetch_desc)
if job_post:
job_list.append(job_post)
if not continue_search():
@ -153,17 +164,17 @@ class LinkedInScraper(Scraper):
if continue_search():
time.sleep(random.uniform(self.delay, self.delay + self.band_delay))
page += self.jobs_per_page
start += len(job_list)
job_list = job_list[: scraper_input.results_wanted]
return JobResponse(jobs=job_list)
def _process_job(
self, job_card: Tag, job_url: str, full_descr: bool
self, job_card: Tag, job_id: str, full_descr: bool
) -> Optional[JobPost]:
salary_tag = job_card.find("span", class_="job-search-card__salary-info")
compensation = None
compensation = description = None
if salary_tag:
salary_text = salary_tag.get_text(separator=" ").strip()
salary_values = [currency_parser(value) for value in salary_text.split("-")]
@ -206,49 +217,44 @@ class LinkedInScraper(Scraper):
date_posted = None
job_details = {}
if full_descr:
job_details = self._get_job_details(job_url)
job_details = self._get_job_details(job_id)
description = job_details.get("description")
is_remote = is_job_remote(title, description, location)
return JobPost(
id=self._get_id(job_url),
id=f"li-{job_id}",
title=title,
company_name=company,
company_url=company_url,
location=location,
is_remote=is_remote,
date_posted=date_posted,
job_url=job_url,
job_url=f"{self.base_url}/jobs/view/{job_id}",
compensation=compensation,
job_type=job_details.get("job_type"),
job_level=job_details.get("job_level", "").lower(),
company_industry=job_details.get("company_industry"),
description=job_details.get("description"),
job_url_direct=job_details.get("job_url_direct"),
emails=extract_emails_from_text(job_details.get("description")),
logo_photo_url=job_details.get("logo_photo_url"),
emails=extract_emails_from_text(description),
company_logo=job_details.get("company_logo"),
job_function=job_details.get("job_function"),
)
def _get_id(self, url: str):
"""
Extracts the job id from the job url
:param url:
:return: str
"""
if not url:
return None
return url.split("/")[-1]
def _get_job_details(self, job_page_url: str) -> dict:
def _get_job_details(self, job_id: str) -> dict:
"""
Retrieves job description and other job details by going to the job page url
:param job_page_url:
:return: dict
"""
try:
session = create_session(is_tls=False, has_retry=True)
response = session.get(
job_page_url, headers=self.headers, timeout=5, proxies=self.proxy
response = self.session.get(
f"{self.base_url}/jobs/view/{job_id}", timeout=5
)
response.raise_for_status()
except:
return {}
if response.url == "https://www.linkedin.com/signup":
if "linkedin.com/signup" in response.url:
return {}
soup = BeautifulSoup(response.text, "html.parser")
@ -257,23 +263,36 @@ class LinkedInScraper(Scraper):
)
description = None
if div_content is not None:
def remove_attributes(tag):
for attr in list(tag.attrs):
del tag[attr]
return tag
div_content = remove_attributes(div_content)
description = div_content.prettify(formatter="html")
if self.scraper_input.description_format == DescriptionFormat.MARKDOWN:
description = markdown_converter(description)
h3_tag = soup.find(
"h3", text=lambda text: text and "Job function" in text.strip()
)
job_function = None
if h3_tag:
job_function_span = h3_tag.find_next(
"span", class_="description__job-criteria-text"
)
if job_function_span:
job_function = job_function_span.text.strip()
company_logo = (
logo_image.get("data-delayed-url")
if (logo_image := soup.find("img", {"class": "artdeco-entity-image"}))
else None
)
return {
"description": description,
"job_type": self._parse_job_type(soup),
"job_level": parse_job_level(soup),
"company_industry": parse_company_industry(soup),
"job_type": parse_job_type(soup),
"job_url_direct": self._parse_job_url_direct(soup),
"logo_photo_url": soup.find("img", {"class": "artdeco-entity-image"}).get(
"data-delayed-url"
),
"company_logo": company_logo,
"job_function": job_function,
}
def _get_location(self, metadata_card: Optional[Tag]) -> Location:
@ -302,31 +321,6 @@ class LinkedInScraper(Scraper):
location = Location(city=city, state=state, country=country)
return location
@staticmethod
def _parse_job_type(soup_job_type: BeautifulSoup) -> list[JobType] | None:
"""
Gets the job type from job page
:param soup_job_type:
:return: JobType
"""
h3_tag = soup_job_type.find(
"h3",
class_="description__job-criteria-subheader",
string=lambda text: "Employment type" in text,
)
employment_type = None
if h3_tag:
employment_type_span = h3_tag.find_next_sibling(
"span",
class_="description__job-criteria-text description__job-criteria-text--criteria",
)
if employment_type_span:
employment_type = employment_type_span.get_text(strip=True)
employment_type = employment_type.lower()
employment_type = employment_type.replace("-", "")
return [get_enum_from_job_type(employment_type)] if employment_type else []
def _parse_job_url_direct(self, soup: BeautifulSoup) -> str | None:
"""
Gets the job url direct from job page
@ -340,25 +334,6 @@ class LinkedInScraper(Scraper):
job_url_direct_content.decode_contents().strip()
)
if job_url_direct_match:
job_url_direct = urllib.parse.unquote(job_url_direct_match.group())
job_url_direct = unquote(job_url_direct_match.group())
return job_url_direct
@staticmethod
def job_type_code(job_type_enum: JobType) -> str:
return {
JobType.FULL_TIME: "F",
JobType.PART_TIME: "P",
JobType.INTERNSHIP: "I",
JobType.CONTRACT: "C",
JobType.TEMPORARY: "T",
}.get(job_type_enum, "")
headers = {
"authority": "www.linkedin.com",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"accept-language": "en-US,en;q=0.9",
"cache-control": "max-age=0",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
}

View File

@ -0,0 +1,8 @@
headers = {
"authority": "www.linkedin.com",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"accept-language": "en-US,en;q=0.9",
"cache-control": "max-age=0",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
}

96
jobspy/linkedin/util.py Normal file
View File

@ -0,0 +1,96 @@
from bs4 import BeautifulSoup
from jobspy.model import JobType, Location
from jobspy.util import get_enum_from_job_type
def job_type_code(job_type_enum: JobType) -> str:
return {
JobType.FULL_TIME: "F",
JobType.PART_TIME: "P",
JobType.INTERNSHIP: "I",
JobType.CONTRACT: "C",
JobType.TEMPORARY: "T",
}.get(job_type_enum, "")
def parse_job_type(soup_job_type: BeautifulSoup) -> list[JobType] | None:
"""
Gets the job type from job page
:param soup_job_type:
:return: JobType
"""
h3_tag = soup_job_type.find(
"h3",
class_="description__job-criteria-subheader",
string=lambda text: "Employment type" in text,
)
employment_type = None
if h3_tag:
employment_type_span = h3_tag.find_next_sibling(
"span",
class_="description__job-criteria-text description__job-criteria-text--criteria",
)
if employment_type_span:
employment_type = employment_type_span.get_text(strip=True)
employment_type = employment_type.lower()
employment_type = employment_type.replace("-", "")
return [get_enum_from_job_type(employment_type)] if employment_type else []
def parse_job_level(soup_job_level: BeautifulSoup) -> str | None:
"""
Gets the job level from job page
:param soup_job_level:
:return: str
"""
h3_tag = soup_job_level.find(
"h3",
class_="description__job-criteria-subheader",
string=lambda text: "Seniority level" in text,
)
job_level = None
if h3_tag:
job_level_span = h3_tag.find_next_sibling(
"span",
class_="description__job-criteria-text description__job-criteria-text--criteria",
)
if job_level_span:
job_level = job_level_span.get_text(strip=True)
return job_level
def parse_company_industry(soup_industry: BeautifulSoup) -> str | None:
"""
Gets the company industry from job page
:param soup_industry:
:return: str
"""
h3_tag = soup_industry.find(
"h3",
class_="description__job-criteria-subheader",
string=lambda text: "Industries" in text,
)
industry = None
if h3_tag:
industry_span = h3_tag.find_next_sibling(
"span",
class_="description__job-criteria-text description__job-criteria-text--criteria",
)
if industry_span:
industry = industry_span.get_text(strip=True)
return industry
def is_job_remote(title: dict, description: str, location: Location) -> bool:
"""
Searches the title, location, and description to check if job is remote
"""
remote_keywords = ["remote", "work from home", "wfh"]
location = location.display_location()
full_string = f'{title} {description} {location}'.lower()
is_remote = any(keyword in full_string for keyword in remote_keywords)
return is_remote

View File

@ -1,5 +1,6 @@
from __future__ import annotations
from abc import ABC, abstractmethod
from typing import Optional
from datetime import date
from enum import Enum
@ -68,16 +69,20 @@ class Country(Enum):
AUSTRIA = ("austria", "at", "at")
BAHRAIN = ("bahrain", "bh")
BELGIUM = ("belgium", "be", "fr:be")
BULGARIA = ("bulgaria", "bg")
BRAZIL = ("brazil", "br", "com.br")
CANADA = ("canada", "ca", "ca")
CHILE = ("chile", "cl")
CHINA = ("china", "cn")
COLOMBIA = ("colombia", "co")
COSTARICA = ("costa rica", "cr")
CROATIA = ("croatia", "hr")
CYPRUS = ("cyprus", "cy")
CZECHREPUBLIC = ("czech republic,czechia", "cz")
DENMARK = ("denmark", "dk")
ECUADOR = ("ecuador", "ec")
EGYPT = ("egypt", "eg")
ESTONIA = ("estonia", "ee")
FINLAND = ("finland", "fi")
FRANCE = ("france", "fr", "fr")
GERMANY = ("germany", "de", "de")
@ -91,8 +96,11 @@ class Country(Enum):
ITALY = ("italy", "it", "it")
JAPAN = ("japan", "jp")
KUWAIT = ("kuwait", "kw")
LATVIA = ("latvia", "lv")
LITHUANIA = ("lithuania", "lt")
LUXEMBOURG = ("luxembourg", "lu")
MALAYSIA = ("malaysia", "malaysia")
MALAYSIA = ("malaysia", "malaysia:my", "com")
MALTA = ("malta", "malta:mt", "mt")
MEXICO = ("mexico", "mx", "com.mx")
MOROCCO = ("morocco", "ma")
NETHERLANDS = ("netherlands", "nl", "nl")
@ -110,6 +118,8 @@ class Country(Enum):
ROMANIA = ("romania", "ro")
SAUDIARABIA = ("saudi arabia", "sa")
SINGAPORE = ("singapore", "sg", "sg")
SLOVAKIA = ("slovakia", "sk")
SLOVENIA = ("slovenia", "sl")
SOUTHAFRICA = ("south africa", "za")
SOUTHKOREA = ("south korea", "kr")
SPAIN = ("spain", "es", "es")
@ -117,7 +127,7 @@ class Country(Enum):
SWITZERLAND = ("switzerland", "ch", "de:ch")
TAIWAN = ("taiwan", "tw")
THAILAND = ("thailand", "th")
TURKEY = ("turkey", "tr")
TURKEY = ("türkiye,turkey", "tr")
UKRAINE = ("ukraine", "ua")
UNITEDARABEMIRATES = ("united arab emirates", "ae")
UK = ("uk,united kingdom", "uk:gb", "co.uk")
@ -242,18 +252,79 @@ class JobPost(BaseModel):
date_posted: date | None = None
emails: list[str] | None = None
is_remote: bool | None = None
listing_type: str | None = None
# indeed specific
company_addresses: str | None = None
# LinkedIn specific
job_level: str | None = None
# LinkedIn and Indeed specific
company_industry: str | None = None
# Indeed specific
company_addresses: str | None = None
company_num_employees: str | None = None
company_revenue: str | None = None
company_description: str | None = None
ceo_name: str | None = None
ceo_photo_url: str | None = None
logo_photo_url: str | None = None
company_logo: str | None = None
banner_photo_url: str | None = None
# LinkedIn only atm
job_function: str | None = None
# Naukri specific
skills: list[str] | None = None #from tagsAndSkills
experience_range: str | None = None #from experienceText
company_rating: float | None = None #from ambitionBoxData.AggregateRating
company_reviews_count: int | None = None #from ambitionBoxData.ReviewsCount
vacancy_count: int | None = None #from vacancy
work_from_home_type: str | None = None #from clusters.wfhType (e.g., "Hybrid", "Remote")
class JobResponse(BaseModel):
jobs: list[JobPost] = []
class Site(Enum):
LINKEDIN = "linkedin"
INDEED = "indeed"
ZIP_RECRUITER = "zip_recruiter"
GLASSDOOR = "glassdoor"
GOOGLE = "google"
BAYT = "bayt"
NAUKRI = "naukri"
class SalarySource(Enum):
DIRECT_DATA = "direct_data"
DESCRIPTION = "description"
class ScraperInput(BaseModel):
site_type: list[Site]
search_term: str | None = None
google_search_term: str | None = None
location: str | None = None
country: Country | None = Country.USA
distance: int | None = None
is_remote: bool = False
job_type: JobType | None = None
easy_apply: bool | None = None
offset: int = 0
linkedin_fetch_description: bool = False
linkedin_company_ids: list[int] | None = None
description_format: DescriptionFormat | None = DescriptionFormat.MARKDOWN
results_wanted: int = 15
hours_old: int | None = None
class Scraper(ABC):
def __init__(
self, site: Site, proxies: list[str] | None = None, ca_cert: str | None = None
):
self.site = site
self.proxies = proxies
self.ca_cert = ca_cert
@abstractmethod
def scrape(self, scraper_input: ScraperInput) -> JobResponse: ...

301
jobspy/naukri/__init__.py Normal file
View File

@ -0,0 +1,301 @@
from __future__ import annotations
import math
import random
import time
from datetime import datetime, date, timedelta
from typing import Optional
import regex as re
import requests
from jobspy.exception import NaukriException
from jobspy.naukri.constant import headers as naukri_headers
from jobspy.naukri.util import (
is_job_remote,
parse_job_type,
parse_company_industry,
)
from jobspy.model import (
JobPost,
Location,
JobResponse,
Country,
Compensation,
DescriptionFormat,
Scraper,
ScraperInput,
Site,
)
from jobspy.util import (
extract_emails_from_text,
currency_parser,
markdown_converter,
create_session,
create_logger,
)
log = create_logger("Naukri")
class Naukri(Scraper):
base_url = "https://www.naukri.com/jobapi/v3/search"
delay = 3
band_delay = 4
jobs_per_page = 20
def __init__(
self, proxies: list[str] | str | None = None, ca_cert: str | None = None
):
"""
Initializes NaukriScraper with the Naukri API URL
"""
super().__init__(Site.NAUKRI, proxies=proxies, ca_cert=ca_cert)
self.session = create_session(
proxies=self.proxies,
ca_cert=ca_cert,
is_tls=False,
has_retry=True,
delay=5,
clear_cookies=True,
)
self.session.headers.update(naukri_headers)
self.scraper_input = None
self.country = "India" #naukri is india-focused by default
log.info("Naukri scraper initialized")
def scrape(self, scraper_input: ScraperInput) -> JobResponse:
"""
Scrapes Naukri API for jobs with scraper_input criteria
:param scraper_input:
:return: job_response
"""
self.scraper_input = scraper_input
job_list: list[JobPost] = []
seen_ids = set()
start = scraper_input.offset or 0
page = (start // self.jobs_per_page) + 1
request_count = 0
seconds_old = (
scraper_input.hours_old * 3600 if scraper_input.hours_old else None
)
continue_search = (
lambda: len(job_list) < scraper_input.results_wanted and page <= 50 # Arbitrary limit
)
while continue_search():
request_count += 1
log.info(
f"Scraping page {request_count} / {math.ceil(scraper_input.results_wanted / self.jobs_per_page)} "
f"for search term: {scraper_input.search_term}"
)
params = {
"noOfResults": self.jobs_per_page,
"urlType": "search_by_keyword",
"searchType": "adv",
"keyword": scraper_input.search_term,
"pageNo": page,
"k": scraper_input.search_term,
"seoKey": f"{scraper_input.search_term.lower().replace(' ', '-')}-jobs",
"src": "jobsearchDesk",
"latLong": "",
"location": scraper_input.location,
"remote": "true" if scraper_input.is_remote else None,
}
if seconds_old:
params["days"] = seconds_old // 86400 # Convert to days
params = {k: v for k, v in params.items() if v is not None}
try:
log.debug(f"Sending request to {self.base_url} with params: {params}")
response = self.session.get(self.base_url, params=params, timeout=10)
if response.status_code not in range(200, 400):
err = f"Naukri API response status code {response.status_code} - {response.text}"
log.error(err)
return JobResponse(jobs=job_list)
data = response.json()
job_details = data.get("jobDetails", [])
log.info(f"Received {len(job_details)} job entries from API")
if not job_details:
log.warning("No job details found in API response")
break
except Exception as e:
log.error(f"Naukri API request failed: {str(e)}")
return JobResponse(jobs=job_list)
for job in job_details:
job_id = job.get("jobId")
if not job_id or job_id in seen_ids:
continue
seen_ids.add(job_id)
log.debug(f"Processing job ID: {job_id}")
try:
fetch_desc = scraper_input.linkedin_fetch_description
job_post = self._process_job(job, job_id, fetch_desc)
if job_post:
job_list.append(job_post)
log.info(f"Added job: {job_post.title} (ID: {job_id})")
if not continue_search():
break
except Exception as e:
log.error(f"Error processing job ID {job_id}: {str(e)}")
raise NaukriException(str(e))
if continue_search():
time.sleep(random.uniform(self.delay, self.delay + self.band_delay))
page += 1
job_list = job_list[:scraper_input.results_wanted]
log.info(f"Scraping completed. Total jobs collected: {len(job_list)}")
return JobResponse(jobs=job_list)
def _process_job(
self, job: dict, job_id: str, full_descr: bool
) -> Optional[JobPost]:
"""
Processes a single job from API response into a JobPost object
"""
title = job.get("title", "N/A")
company = job.get("companyName", "N/A")
company_url = f"https://www.naukri.com/{job.get('staticUrl', '')}" if job.get("staticUrl") else None
location = self._get_location(job.get("placeholders", []))
compensation = self._get_compensation(job.get("placeholders", []))
date_posted = self._parse_date(job.get("footerPlaceholderLabel"), job.get("createdDate"))
job_url = f"https://www.naukri.com{job.get('jdURL', f'/job/{job_id}')}"
description = job.get("jobDescription") if full_descr else None
if description and self.scraper_input.description_format == DescriptionFormat.MARKDOWN:
description = markdown_converter(description)
job_type = parse_job_type(description) if description else None
company_industry = parse_company_industry(description) if description else None
is_remote = is_job_remote(title, description or "", location)
company_logo = job.get("logoPathV3") or job.get("logoPath")
# Naukri-specific fields
skills = job.get("tagsAndSkills", "").split(",") if job.get("tagsAndSkills") else None
experience_range = job.get("experienceText")
ambition_box = job.get("ambitionBoxData", {})
company_rating = float(ambition_box.get("AggregateRating")) if ambition_box.get("AggregateRating") else None
company_reviews_count = ambition_box.get("ReviewsCount")
vacancy_count = job.get("vacancy")
work_from_home_type = self._infer_work_from_home_type(job.get("placeholders", []), title, description or "")
job_post = JobPost(
id=f"nk-{job_id}",
title=title,
company_name=company,
company_url=company_url,
location=location,
is_remote=is_remote,
date_posted=date_posted,
job_url=job_url,
compensation=compensation,
job_type=job_type,
company_industry=company_industry,
description=description,
emails=extract_emails_from_text(description or ""),
company_logo=company_logo,
skills=skills,
experience_range=experience_range,
company_rating=company_rating,
company_reviews_count=company_reviews_count,
vacancy_count=vacancy_count,
work_from_home_type=work_from_home_type,
)
log.debug(f"Processed job: {title} at {company}")
return job_post
def _get_location(self, placeholders: list[dict]) -> Location:
"""
Extracts location data from placeholders
"""
location = Location(country=Country.INDIA)
for placeholder in placeholders:
if placeholder.get("type") == "location":
location_str = placeholder.get("label", "")
parts = location_str.split(", ")
city = parts[0] if parts else None
state = parts[1] if len(parts) > 1 else None
location = Location(city=city, state=state, country=Country.INDIA)
log.debug(f"Parsed location: {location.display_location()}")
break
return location
def _get_compensation(self, placeholders: list[dict]) -> Optional[Compensation]:
"""
Extracts compensation data from placeholders, handling Indian salary formats (Lakhs, Crores)
"""
for placeholder in placeholders:
if placeholder.get("type") == "salary":
salary_text = placeholder.get("label", "").strip()
if salary_text == "Not disclosed":
log.debug("Salary not disclosed")
return None
# Handle Indian salary formats (e.g., "12-16 Lacs P.A.", "1-5 Cr")
salary_match = re.match(r"(\d+(?:\.\d+)?)\s*-\s*(\d+(?:\.\d+)?)\s*(Lacs|Lakh|Cr)\s*(P\.A\.)?", salary_text, re.IGNORECASE)
if salary_match:
min_salary, max_salary, unit = salary_match.groups()[:3]
min_salary, max_salary = float(min_salary), float(max_salary)
currency = "INR"
# Convert to base units (INR)
if unit.lower() in ("lacs", "lakh"):
min_salary *= 100000 # 1 Lakh = 100,000 INR
max_salary *= 100000
elif unit.lower() == "cr":
min_salary *= 10000000 # 1 Crore = 10,000,000 INR
max_salary *= 10000000
log.debug(f"Parsed salary: {min_salary} - {max_salary} INR")
return Compensation(
min_amount=int(min_salary),
max_amount=int(max_salary),
currency=currency,
)
else:
log.debug(f"Could not parse salary: {salary_text}")
return None
return None
def _parse_date(self, label: str, created_date: int) -> Optional[date]:
"""
Parses date from footerPlaceholderLabel or createdDate, returning a date object
"""
today = datetime.now()
if not label:
if created_date:
return datetime.fromtimestamp(created_date / 1000).date() # Convert to date
return None
label = label.lower()
if "today" in label or "just now" in label or "few hours" in label:
log.debug("Date parsed as today")
return today.date()
elif "ago" in label:
match = re.search(r"(\d+)\s*day", label)
if match:
days = int(match.group(1))
parsed_date = (today - timedelta(days = days)).date()
log.debug(f"Date parsed: {days} days ago -> {parsed_date}")
return parsed_date
elif created_date:
parsed_date = datetime.fromtimestamp(created_date / 1000).date()
log.debug(f"Date parsed from timestamp: {parsed_date}")
return parsed_date
log.debug("No date parsed")
return None
def _infer_work_from_home_type(self, placeholders: list[dict], title: str, description: str) -> Optional[str]:
"""
Infers work-from-home type from job data (e.g., 'Hybrid', 'Remote', 'Work from office')
"""
location_str = next((p["label"] for p in placeholders if p["type"] == "location"), "").lower()
if "hybrid" in location_str or "hybrid" in title.lower() or "hybrid" in description.lower():
return "Hybrid"
elif "remote" in location_str or "remote" in title.lower() or "remote" in description.lower():
return "Remote"
elif "work from office" in description.lower() or not ("remote" in description.lower() or "hybrid" in description.lower()):
return "Work from office"
return None

11
jobspy/naukri/constant.py Normal file
View File

@ -0,0 +1,11 @@
headers = {
"authority": "www.naukri.com",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"accept-language": "en-US,en;q=0.9",
"cache-control": "max-age=0",
"upgrade-insecure-requests": "1",
"appid": "109",
"systemid": "Naukri",
"Nkparam": "Ppy0YK9uSHqPtG3bEejYc04RTpUN2CjJOrqA68tzQt0SKJHXZKzz9M8cZtKLVkoOuQmfe4cTb1r2CwfHaxW5Tg==",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
}

34
jobspy/naukri/util.py Normal file
View File

@ -0,0 +1,34 @@
from __future__ import annotations
from bs4 import BeautifulSoup
from jobspy.model import JobType, Location
from jobspy.util import get_enum_from_job_type
def parse_job_type(soup: BeautifulSoup) -> list[JobType] | None:
"""
Gets the job type from the job page
"""
job_type_tag = soup.find("span", class_="job-type")
if job_type_tag:
job_type_str = job_type_tag.get_text(strip=True).lower().replace("-", "")
return [get_enum_from_job_type(job_type_str)] if job_type_str else None
return None
def parse_company_industry(soup: BeautifulSoup) -> str | None:
"""
Gets the company industry from the job page
"""
industry_tag = soup.find("span", class_="industry")
return industry_tag.get_text(strip=True) if industry_tag else None
def is_job_remote(title: str, description: str, location: Location) -> bool:
"""
Searches the title, description, and location to check if the job is remote
"""
remote_keywords = ["remote", "work from home", "wfh"]
location_str = location.display_location()
full_string = f"{title} {description} {location_str}".lower()
return any(keyword in full_string for keyword in remote_keywords)

354
jobspy/util.py Normal file
View File

@ -0,0 +1,354 @@
from __future__ import annotations
import logging
import re
from itertools import cycle
import numpy as np
import requests
import tls_client
import urllib3
from markdownify import markdownify as md
from requests.adapters import HTTPAdapter, Retry
from jobspy.model import CompensationInterval, JobType, Site
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
def create_logger(name: str):
logger = logging.getLogger(f"JobSpy:{name}")
logger.propagate = False
if not logger.handlers:
logger.setLevel(logging.INFO)
console_handler = logging.StreamHandler()
format = "%(asctime)s - %(levelname)s - %(name)s - %(message)s"
formatter = logging.Formatter(format)
console_handler.setFormatter(formatter)
logger.addHandler(console_handler)
return logger
class RotatingProxySession:
def __init__(self, proxies=None):
if isinstance(proxies, str):
self.proxy_cycle = cycle([self.format_proxy(proxies)])
elif isinstance(proxies, list):
self.proxy_cycle = (
cycle([self.format_proxy(proxy) for proxy in proxies])
if proxies
else None
)
else:
self.proxy_cycle = None
@staticmethod
def format_proxy(proxy):
"""Utility method to format a proxy string into a dictionary."""
if proxy.startswith("http://") or proxy.startswith("https://"):
return {"http": proxy, "https": proxy}
if proxy.startswith("socks5://"):
return {"http": proxy, "https": proxy}
return {"http": f"http://{proxy}", "https": f"http://{proxy}"}
class RequestsRotating(RotatingProxySession, requests.Session):
def __init__(self, proxies=None, has_retry=False, delay=1, clear_cookies=False):
RotatingProxySession.__init__(self, proxies=proxies)
requests.Session.__init__(self)
self.clear_cookies = clear_cookies
self.allow_redirects = True
self.setup_session(has_retry, delay)
def setup_session(self, has_retry, delay):
if has_retry:
retries = Retry(
total=3,
connect=3,
status=3,
status_forcelist=[500, 502, 503, 504, 429],
backoff_factor=delay,
)
adapter = HTTPAdapter(max_retries=retries)
self.mount("http://", adapter)
self.mount("https://", adapter)
def request(self, method, url, **kwargs):
if self.clear_cookies:
self.cookies.clear()
if self.proxy_cycle:
next_proxy = next(self.proxy_cycle)
if next_proxy["http"] != "http://localhost":
self.proxies = next_proxy
else:
self.proxies = {}
return requests.Session.request(self, method, url, **kwargs)
class TLSRotating(RotatingProxySession, tls_client.Session):
def __init__(self, proxies=None):
RotatingProxySession.__init__(self, proxies=proxies)
tls_client.Session.__init__(self, random_tls_extension_order=True)
def execute_request(self, *args, **kwargs):
if self.proxy_cycle:
next_proxy = next(self.proxy_cycle)
if next_proxy["http"] != "http://localhost":
self.proxies = next_proxy
else:
self.proxies = {}
response = tls_client.Session.execute_request(self, *args, **kwargs)
response.ok = response.status_code in range(200, 400)
return response
def create_session(
*,
proxies: dict | str | None = None,
ca_cert: str | None = None,
is_tls: bool = True,
has_retry: bool = False,
delay: int = 1,
clear_cookies: bool = False,
) -> requests.Session:
"""
Creates a requests session with optional tls, proxy, and retry settings.
:return: A session object
"""
if is_tls:
session = TLSRotating(proxies=proxies)
else:
session = RequestsRotating(
proxies=proxies,
has_retry=has_retry,
delay=delay,
clear_cookies=clear_cookies,
)
if ca_cert:
session.verify = ca_cert
return session
def set_logger_level(verbose: int):
"""
Adjusts the logger's level. This function allows the logging level to be changed at runtime.
Parameters:
- verbose: int {0, 1, 2} (default=2, all logs)
"""
if verbose is None:
return
level_name = {2: "INFO", 1: "WARNING", 0: "ERROR"}.get(verbose, "INFO")
level = getattr(logging, level_name.upper(), None)
if level is not None:
for logger_name in logging.root.manager.loggerDict:
if logger_name.startswith("JobSpy:"):
logging.getLogger(logger_name).setLevel(level)
else:
raise ValueError(f"Invalid log level: {level_name}")
def markdown_converter(description_html: str):
if description_html is None:
return None
markdown = md(description_html)
return markdown.strip()
def extract_emails_from_text(text: str) -> list[str] | None:
if not text:
return None
email_regex = re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
return email_regex.findall(text)
def get_enum_from_job_type(job_type_str: str) -> JobType | None:
"""
Given a string, returns the corresponding JobType enum member if a match is found.
"""
res = None
for job_type in JobType:
if job_type_str in job_type.value:
res = job_type
return res
def currency_parser(cur_str):
# Remove any non-numerical characters
# except for ',' '.' or '-' (e.g. EUR)
cur_str = re.sub("[^-0-9.,]", "", cur_str)
# Remove any 000s separators (either , or .)
cur_str = re.sub("[.,]", "", cur_str[:-3]) + cur_str[-3:]
if "." in list(cur_str[-3:]):
num = float(cur_str)
elif "," in list(cur_str[-3:]):
num = float(cur_str.replace(",", "."))
else:
num = float(cur_str)
return np.round(num, 2)
def remove_attributes(tag):
for attr in list(tag.attrs):
del tag[attr]
return tag
def extract_salary(
salary_str,
lower_limit=1000,
upper_limit=700000,
hourly_threshold=350,
monthly_threshold=30000,
enforce_annual_salary=False,
):
"""
Extracts salary information from a string and returns the salary interval, min and max salary values, and currency.
(TODO: Needs test cases as the regex is complicated and may not cover all edge cases)
"""
if not salary_str:
return None, None, None, None
annual_max_salary = None
min_max_pattern = r"\$(\d+(?:,\d+)?(?:\.\d+)?)([kK]?)\s*[-—–]\s*(?:\$)?(\d+(?:,\d+)?(?:\.\d+)?)([kK]?)"
def to_int(s):
return int(float(s.replace(",", "")))
def convert_hourly_to_annual(hourly_wage):
return hourly_wage * 2080
def convert_monthly_to_annual(monthly_wage):
return monthly_wage * 12
match = re.search(min_max_pattern, salary_str)
if match:
min_salary = to_int(match.group(1))
max_salary = to_int(match.group(3))
# Handle 'k' suffix for min and max salaries independently
if "k" in match.group(2).lower() or "k" in match.group(4).lower():
min_salary *= 1000
max_salary *= 1000
# Convert to annual if less than the hourly threshold
if min_salary < hourly_threshold:
interval = CompensationInterval.HOURLY.value
annual_min_salary = convert_hourly_to_annual(min_salary)
if max_salary < hourly_threshold:
annual_max_salary = convert_hourly_to_annual(max_salary)
elif min_salary < monthly_threshold:
interval = CompensationInterval.MONTHLY.value
annual_min_salary = convert_monthly_to_annual(min_salary)
if max_salary < monthly_threshold:
annual_max_salary = convert_monthly_to_annual(max_salary)
else:
interval = CompensationInterval.YEARLY.value
annual_min_salary = min_salary
annual_max_salary = max_salary
# Ensure salary range is within specified limits
if not annual_max_salary:
return None, None, None, None
if (
lower_limit <= annual_min_salary <= upper_limit
and lower_limit <= annual_max_salary <= upper_limit
and annual_min_salary < annual_max_salary
):
if enforce_annual_salary:
return interval, annual_min_salary, annual_max_salary, "USD"
else:
return interval, min_salary, max_salary, "USD"
return None, None, None, None
def extract_job_type(description: str):
if not description:
return []
keywords = {
JobType.FULL_TIME: r"full\s?time",
JobType.PART_TIME: r"part\s?time",
JobType.INTERNSHIP: r"internship",
JobType.CONTRACT: r"contract",
}
listing_types = []
for key, pattern in keywords.items():
if re.search(pattern, description, re.IGNORECASE):
listing_types.append(key)
return listing_types if listing_types else None
def map_str_to_site(site_name: str) -> Site:
return Site[site_name.upper()]
def get_enum_from_value(value_str):
for job_type in JobType:
if value_str in job_type.value:
return job_type
raise Exception(f"Invalid job type: {value_str}")
def convert_to_annual(job_data: dict):
if job_data["interval"] == "hourly":
job_data["min_amount"] *= 2080
job_data["max_amount"] *= 2080
if job_data["interval"] == "monthly":
job_data["min_amount"] *= 12
job_data["max_amount"] *= 12
if job_data["interval"] == "weekly":
job_data["min_amount"] *= 52
job_data["max_amount"] *= 52
if job_data["interval"] == "daily":
job_data["min_amount"] *= 260
job_data["max_amount"] *= 260
job_data["interval"] = "yearly"
desired_order = [
"id",
"site",
"job_url",
"job_url_direct",
"title",
"company",
"location",
"date_posted",
"job_type",
"salary_source",
"interval",
"min_amount",
"max_amount",
"currency",
"is_remote",
"job_level",
"job_function",
"listing_type",
"emails",
"description",
"company_industry",
"company_url",
"company_logo",
"company_url_direct",
"company_addresses",
"company_num_employees",
"company_revenue",
"company_description",
# naukri-specific fields
"skills",
"experience_range",
"company_rating",
"company_reviews_count",
"vacancy_count",
"work_from_home_type",
]

View File

@ -1,49 +1,54 @@
"""
jobspy.scrapers.ziprecruiter
~~~~~~~~~~~~~~~~~~~
This module contains routines to scrape ZipRecruiter.
"""
from __future__ import annotations
import json
import math
import re
import time
from datetime import datetime
from typing import Optional, Tuple, Any
from concurrent.futures import ThreadPoolExecutor
from datetime import datetime
from .. import Scraper, ScraperInput, Site
from ..utils import (
logger,
from bs4 import BeautifulSoup
from jobspy.ziprecruiter.constant import headers, get_cookie_data
from jobspy.util import (
extract_emails_from_text,
create_session,
markdown_converter,
remove_attributes,
create_logger,
)
from ...jobs import (
from jobspy.model import (
JobPost,
Compensation,
Location,
JobResponse,
JobType,
Country,
DescriptionFormat,
Scraper,
ScraperInput,
Site,
)
from jobspy.ziprecruiter.util import get_job_type_enum, add_params
log = create_logger("ZipRecruiter")
class ZipRecruiterScraper(Scraper):
class ZipRecruiter(Scraper):
base_url = "https://www.ziprecruiter.com"
api_url = "https://api.ziprecruiter.com"
def __init__(self, proxy: Optional[str] = None):
def __init__(
self, proxies: list[str] | str | None = None, ca_cert: str | None = None
):
"""
Initializes ZipRecruiterScraper with the ZipRecruiter job search url
"""
super().__init__(Site.ZIP_RECRUITER, proxies=proxies)
self.scraper_input = None
self.session = create_session(proxy)
self.session = create_session(proxies=proxies, ca_cert=ca_cert)
self.session.headers.update(headers)
self._get_cookies()
super().__init__(Site.ZIP_RECRUITER, proxy=proxy)
self.delay = 5
self.jobs_per_page = 20
@ -65,7 +70,7 @@ class ZipRecruiterScraper(Scraper):
break
if page > 1:
time.sleep(self.delay)
logger.info(f"ZipRecruiter search page: {page}")
log.info(f"search page: {page} / {max_pages}")
jobs_on_page, continue_token = self._find_jobs_in_page(
scraper_input, continue_token
)
@ -79,7 +84,7 @@ class ZipRecruiterScraper(Scraper):
def _find_jobs_in_page(
self, scraper_input: ScraperInput, continue_token: str | None = None
) -> Tuple[list[JobPost], Optional[str]]:
) -> tuple[list[JobPost], str | None]:
"""
Scrapes a page of ZipRecruiter for jobs with scraper_input criteria
:param scraper_input:
@ -87,26 +92,24 @@ class ZipRecruiterScraper(Scraper):
:return: jobs found on page
"""
jobs_list = []
params = self._add_params(scraper_input)
params = add_params(scraper_input)
if continue_token:
params["continue_from"] = continue_token
try:
res = self.session.get(
f"{self.api_url}/jobs-app/jobs", headers=self.headers, params=params
)
res = self.session.get(f"{self.api_url}/jobs-app/jobs", params=params)
if res.status_code not in range(200, 400):
if res.status_code == 429:
err = "429 Response - Blocked by ZipRecruiter for too many requests"
else:
err = f"ZipRecruiter response status code {res.status_code}"
err += f" with response: {res.text}" # ZipRecruiter likely not available in EU
logger.error(err)
log.error(err)
return jobs_list, ""
except Exception as e:
if "Proxy responded with" in str(e):
logger.error(f"Indeed: Bad proxy")
log.error(f"Indeed: Bad proxy")
else:
logger.error(f"Indeed: {str(e)}")
log.error(f"Indeed: {str(e)}")
return jobs_list, ""
res_data = res.json()
@ -129,6 +132,7 @@ class ZipRecruiterScraper(Scraper):
self.seen_urls.add(job_url)
description = job.get("job_description", "").strip()
listing_type = job.get("buyer_type", "")
description = (
markdown_converter(description)
if self.scraper_input.description_format == DescriptionFormat.MARKDOWN
@ -141,7 +145,7 @@ class ZipRecruiterScraper(Scraper):
location = Location(
city=job.get("job_city"), state=job.get("job_state"), country=country_enum
)
job_type = self._get_job_type_enum(
job_type = get_job_type_enum(
job.get("employment_type", "").replace("_", "").lower()
)
date_posted = datetime.fromisoformat(job["posted_time"].rstrip("Z")).date()
@ -150,8 +154,10 @@ class ZipRecruiterScraper(Scraper):
comp_min = int(job["compensation_min"]) if "compensation_min" in job else None
comp_max = int(job["compensation_max"]) if "compensation_max" in job else None
comp_currency = job.get("compensation_currency")
description_full, job_url_direct = self._get_descr(job_url)
return JobPost(
id=str(job['listing_key']),
id=f'zr-{job["listing_key"]}',
title=title,
company_name=company,
location=location,
@ -164,49 +170,50 @@ class ZipRecruiterScraper(Scraper):
),
date_posted=date_posted,
job_url=job_url,
description=description,
description=description_full if description_full else description,
emails=extract_emails_from_text(description) if description else None,
job_url_direct=job_url_direct,
listing_type=listing_type,
)
def _get_descr(self, job_url):
res = self.session.get(job_url, allow_redirects=True)
description_full = job_url_direct = None
if res.ok:
soup = BeautifulSoup(res.text, "html.parser")
job_descr_div = soup.find("div", class_="job_description")
company_descr_section = soup.find("section", class_="company_description")
job_description_clean = (
remove_attributes(job_descr_div).prettify(formatter="html")
if job_descr_div
else ""
)
company_description_clean = (
remove_attributes(company_descr_section).prettify(formatter="html")
if company_descr_section
else ""
)
description_full = job_description_clean + company_description_clean
try:
script_tag = soup.find("script", type="application/json")
if script_tag:
job_json = json.loads(script_tag.string)
job_url_val = job_json["model"].get("saveJobURL", "")
m = re.search(r"job_url=(.+)", job_url_val)
if m:
job_url_direct = m.group(1)
except:
job_url_direct = None
if self.scraper_input.description_format == DescriptionFormat.MARKDOWN:
description_full = markdown_converter(description_full)
return description_full, job_url_direct
def _get_cookies(self):
data = "event_type=session&logged_in=false&number_of_retry=1&property=model%3AiPhone&property=os%3AiOS&property=locale%3Aen_us&property=app_build_number%3A4734&property=app_version%3A91.0&property=manufacturer%3AApple&property=timestamp%3A2024-01-12T12%3A04%3A42-06%3A00&property=screen_height%3A852&property=os_version%3A16.6.1&property=source%3Ainstall&property=screen_width%3A393&property=device_model%3AiPhone%2014%20Pro&property=brand%3AApple"
"""
Sends a session event to the API with device properties.
"""
url = f"{self.api_url}/jobs-app/event"
self.session.post(url, data=data, headers=self.headers)
@staticmethod
def _get_job_type_enum(job_type_str: str) -> list[JobType] | None:
for job_type in JobType:
if job_type_str in job_type.value:
return [job_type]
return None
@staticmethod
def _add_params(scraper_input) -> dict[str, str | Any]:
params = {
"search": scraper_input.search_term,
"location": scraper_input.location,
}
if scraper_input.hours_old:
params["days"] = max(scraper_input.hours_old // 24, 1)
job_type_map = {JobType.FULL_TIME: "full_time", JobType.PART_TIME: "part_time"}
if scraper_input.job_type:
job_type = scraper_input.job_type
params["employment_type"] = job_type_map.get(job_type, job_type.value[0])
if scraper_input.easy_apply:
params["zipapply"] = 1
if scraper_input.is_remote:
params["remote"] = 1
if scraper_input.distance:
params["radius"] = scraper_input.distance
return {k: v for k, v in params.items() if v is not None}
headers = {
"Host": "api.ziprecruiter.com",
"accept": "*/*",
"x-zr-zva-override": "100000000;vid:ZT1huzm_EQlDTVEc",
"x-pushnotificationid": "0ff4983d38d7fc5b3370297f2bcffcf4b3321c418f5c22dd152a0264707602a0",
"x-deviceid": "D77B3A92-E589-46A4-8A39-6EF6F1D86006",
"user-agent": "Job Search/87.0 (iPhone; CPU iOS 16_6_1 like Mac OS X)",
"authorization": "Basic YTBlZjMyZDYtN2I0Yy00MWVkLWEyODMtYTI1NDAzMzI0YTcyOg==",
"accept-language": "en-US,en;q=0.9",
}
self.session.post(url, data=get_cookie_data)

View File

@ -0,0 +1,29 @@
headers = {
"Host": "api.ziprecruiter.com",
"accept": "*/*",
"x-zr-zva-override": "100000000;vid:ZT1huzm_EQlDTVEc",
"x-pushnotificationid": "0ff4983d38d7fc5b3370297f2bcffcf4b3321c418f5c22dd152a0264707602a0",
"x-deviceid": "D77B3A92-E589-46A4-8A39-6EF6F1D86006",
"user-agent": "Job Search/87.0 (iPhone; CPU iOS 16_6_1 like Mac OS X)",
"authorization": "Basic YTBlZjMyZDYtN2I0Yy00MWVkLWEyODMtYTI1NDAzMzI0YTcyOg==",
"accept-language": "en-US,en;q=0.9",
}
get_cookie_data = [
("event_type", "session"),
("logged_in", "false"),
("number_of_retry", "1"),
("property", "model:iPhone"),
("property", "os:iOS"),
("property", "locale:en_us"),
("property", "app_build_number:4734"),
("property", "app_version:91.0"),
("property", "manufacturer:Apple"),
("property", "timestamp:2025-01-12T12:04:42-06:00"),
("property", "screen_height:852"),
("property", "os_version:16.6.1"),
("property", "source:install"),
("property", "screen_width:393"),
("property", "device_model:iPhone 14 Pro"),
("property", "brand:Apple"),
]

View File

@ -0,0 +1,31 @@
from jobspy.model import JobType
def add_params(scraper_input) -> dict[str, str | int]:
params: dict[str, str | int] = {
"search": scraper_input.search_term,
"location": scraper_input.location,
}
if scraper_input.hours_old:
params["days"] = max(scraper_input.hours_old // 24, 1)
job_type_map = {JobType.FULL_TIME: "full_time", JobType.PART_TIME: "part_time"}
if scraper_input.job_type:
job_type = scraper_input.job_type
params["employment_type"] = job_type_map.get(job_type, job_type.value[0])
if scraper_input.easy_apply:
params["zipapply"] = 1
if scraper_input.is_remote:
params["remote"] = 1
if scraper_input.distance:
params["radius"] = scraper_input.distance
return {k: v for k, v in params.items() if v is not None}
def get_job_type_enum(job_type_str: str) -> list[JobType] | None:
for job_type in JobType:
if job_type_str in job_type.value:
return [job_type]
return None

2181
poetry.lock generated

File diff suppressed because it is too large Load Diff

View File

@ -1,36 +1,33 @@
[build-system]
requires = [ "poetry-core",]
build-backend = "poetry.core.masonry.api"
[tool.poetry]
name = "python-jobspy"
version = "1.1.53"
description = "Job scraper for LinkedIn, Indeed, Glassdoor & ZipRecruiter"
authors = ["Zachary Hampton <zachary@bunsly.com>", "Cullen Watson <cullen@bunsly.com>"]
homepage = "https://github.com/Bunsly/JobSpy"
version = "1.1.80"
description = "Job scraper for LinkedIn, Indeed, Glassdoor, ZipRecruiter & Bayt"
authors = ["Cullen Watson <cullen@cullenwatson.com>", "Zachary Hampton <zachary@zacharysproducts.com>"]
homepage = "https://github.com/cullenwatson/JobSpy"
readme = "README.md"
keywords = [ "jobs-scraper", "linkedin", "indeed", "glassdoor", "ziprecruiter", "bayt", "naukri"]
[[tool.poetry.packages]]
include = "jobspy"
packages = [
{ include = "jobspy", from = "src" }
]
[tool.black]
line-length = 88
[tool.poetry.dependencies]
python = "^3.10"
requests = "^2.31.0"
beautifulsoup4 = "^4.12.2"
pandas = "^2.1.0"
NUMPY = "1.24.2"
NUMPY = "1.26.3"
pydantic = "^2.3.0"
tls-client = "^1.0.1"
markdownify = "^0.11.6"
markdownify = "^0.13.1"
regex = "^2024.4.28"
[tool.poetry.group.dev.dependencies]
pytest = "^7.4.1"
jupyter = "^1.0.0"
black = "*"
pre-commit = "*"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
[tool.black]
line-length = 88

View File

View File

@ -1,47 +0,0 @@
from __future__ import annotations
from abc import ABC, abstractmethod
from ..jobs import (
Enum,
BaseModel,
JobType,
JobResponse,
Country,
DescriptionFormat,
)
class Site(Enum):
LINKEDIN = "linkedin"
INDEED = "indeed"
ZIP_RECRUITER = "zip_recruiter"
GLASSDOOR = "glassdoor"
class ScraperInput(BaseModel):
site_type: list[Site]
search_term: str | None = None
location: str | None = None
country: Country | None = Country.USA
distance: int | None = None
is_remote: bool = False
job_type: JobType | None = None
easy_apply: bool | None = None
offset: int = 0
linkedin_fetch_description: bool = False
linkedin_company_ids: list[int] | None = None
description_format: DescriptionFormat | None = DescriptionFormat.MARKDOWN
results_wanted: int = 15
hours_old: int | None = None
class Scraper(ABC):
def __init__(self, site: Site, proxy: list[str] | None = None):
self.site = site
self.proxy = (lambda p: {"http": p, "https": p} if p else None)(proxy)
@abstractmethod
def scrape(self, scraper_input: ScraperInput) -> JobResponse: ...

View File

@ -1,535 +0,0 @@
"""
jobspy.scrapers.glassdoor
~~~~~~~~~~~~~~~~~~~
This module contains routines to scrape Glassdoor.
"""
from __future__ import annotations
import re
import json
import requests
from typing import Optional, Tuple
from datetime import datetime, timedelta
from concurrent.futures import ThreadPoolExecutor, as_completed
from .. import Scraper, ScraperInput, Site
from ..utils import extract_emails_from_text
from ..exceptions import GlassdoorException
from ..utils import (
create_session,
markdown_converter,
logger,
)
from ...jobs import (
JobPost,
Compensation,
CompensationInterval,
Location,
JobResponse,
JobType,
DescriptionFormat,
)
class GlassdoorScraper(Scraper):
def __init__(self, proxy: Optional[str] = None):
"""
Initializes GlassdoorScraper with the Glassdoor job search url
"""
site = Site(Site.GLASSDOOR)
super().__init__(site, proxy=proxy)
self.base_url = None
self.country = None
self.session = None
self.scraper_input = None
self.jobs_per_page = 30
self.max_pages = 30
self.seen_urls = set()
def scrape(self, scraper_input: ScraperInput) -> JobResponse:
"""
Scrapes Glassdoor for jobs with scraper_input criteria.
:param scraper_input: Information about job search criteria.
:return: JobResponse containing a list of jobs.
"""
self.scraper_input = scraper_input
self.scraper_input.results_wanted = min(900, scraper_input.results_wanted)
self.base_url = self.scraper_input.country.get_glassdoor_url()
self.session = create_session(self.proxy, is_tls=True, has_retry=True)
token = self._get_csrf_token()
self.headers["gd-csrf-token"] = token if token else self.fallback_token
location_id, location_type = self._get_location(
scraper_input.location, scraper_input.is_remote
)
if location_type is None:
logger.error("Glassdoor: location not parsed")
return JobResponse(jobs=[])
all_jobs: list[JobPost] = []
cursor = None
range_start = 1 + (scraper_input.offset // self.jobs_per_page)
tot_pages = (scraper_input.results_wanted // self.jobs_per_page) + 2
range_end = min(tot_pages, self.max_pages + 1)
for page in range(range_start, range_end):
logger.info(f"Glassdoor search page: {page}")
try:
jobs, cursor = self._fetch_jobs_page(
scraper_input, location_id, location_type, page, cursor
)
all_jobs.extend(jobs)
if not jobs or len(all_jobs) >= scraper_input.results_wanted:
all_jobs = all_jobs[: scraper_input.results_wanted]
break
except Exception as e:
logger.error(f"Glassdoor: {str(e)}")
break
return JobResponse(jobs=all_jobs)
def _fetch_jobs_page(
self,
scraper_input: ScraperInput,
location_id: int,
location_type: str,
page_num: int,
cursor: str | None,
) -> Tuple[list[JobPost], str | None]:
"""
Scrapes a page of Glassdoor for jobs with scraper_input criteria
"""
jobs = []
self.scraper_input = scraper_input
try:
payload = self._add_payload(location_id, location_type, page_num, cursor)
response = self.session.post(
f"{self.base_url}/graph",
headers=self.headers,
timeout_seconds=15,
data=payload,
)
if response.status_code != 200:
exc_msg = f"bad response status code: {response.status_code}"
raise GlassdoorException(exc_msg)
res_json = response.json()[0]
if "errors" in res_json:
raise ValueError("Error encountered in API response")
except (
requests.exceptions.ReadTimeout,
GlassdoorException,
ValueError,
Exception,
) as e:
logger.error(f"Glassdoor: {str(e)}")
return jobs, None
jobs_data = res_json["data"]["jobListings"]["jobListings"]
with ThreadPoolExecutor(max_workers=self.jobs_per_page) as executor:
future_to_job_data = {
executor.submit(self._process_job, job): job for job in jobs_data
}
for future in as_completed(future_to_job_data):
try:
job_post = future.result()
if job_post:
jobs.append(job_post)
except Exception as exc:
raise GlassdoorException(f"Glassdoor generated an exception: {exc}")
return jobs, self.get_cursor_for_page(
res_json["data"]["jobListings"]["paginationCursors"], page_num + 1
)
def _get_csrf_token(self):
"""
Fetches csrf token needed for API by visiting a generic page
"""
res = self.session.get(
f"{self.base_url}/Job/computer-science-jobs.htm", headers=self.headers
)
pattern = r'"token":\s*"([^"]+)"'
matches = re.findall(pattern, res.text)
token = None
if matches:
token = matches[0]
return token
def _process_job(self, job_data):
"""
Processes a single job and fetches its description.
"""
job_id = job_data["jobview"]["job"]["listingId"]
job_url = f"{self.base_url}job-listing/j?jl={job_id}"
if job_url in self.seen_urls:
return None
self.seen_urls.add(job_url)
job = job_data["jobview"]
title = job["job"]["jobTitleText"]
company_name = job["header"]["employerNameFromSearch"]
company_id = job_data["jobview"]["header"]["employer"]["id"]
location_name = job["header"].get("locationName", "")
location_type = job["header"].get("locationType", "")
age_in_days = job["header"].get("ageInDays")
is_remote, location = False, None
date_diff = (datetime.now() - timedelta(days=age_in_days)).date()
date_posted = date_diff if age_in_days is not None else None
if location_type == "S":
is_remote = True
else:
location = self.parse_location(location_name)
compensation = self.parse_compensation(job["header"])
try:
description = self._fetch_job_description(job_id)
except:
description = None
company_url = f"{self.base_url}Overview/W-EI_IE{company_id}.htm"
return JobPost(
id=str(job_id),
title=title,
company_url=company_url if company_id else None,
company_name=company_name,
date_posted=date_posted,
job_url=job_url,
location=location,
compensation=compensation,
is_remote=is_remote,
description=description,
emails=extract_emails_from_text(description) if description else None,
)
def _fetch_job_description(self, job_id):
"""
Fetches the job description for a single job ID.
"""
url = f"{self.base_url}/graph"
body = [
{
"operationName": "JobDetailQuery",
"variables": {
"jl": job_id,
"queryString": "q",
"pageTypeEnum": "SERP",
},
"query": """
query JobDetailQuery($jl: Long!, $queryString: String, $pageTypeEnum: PageTypeEnum) {
jobview: jobView(
listingId: $jl
contextHolder: {queryString: $queryString, pageTypeEnum: $pageTypeEnum}
) {
job {
description
__typename
}
__typename
}
}
""",
}
]
res = requests.post(url, json=body, headers=self.headers)
if res.status_code != 200:
return None
data = res.json()[0]
desc = data["data"]["jobview"]["job"]["description"]
if self.scraper_input.description_format == DescriptionFormat.MARKDOWN:
desc = markdown_converter(desc)
return desc
def _get_location(self, location: str, is_remote: bool) -> (int, str):
if not location or is_remote:
return "11047", "STATE" # remote options
url = f"{self.base_url}/findPopularLocationAjax.htm?maxLocationsToReturn=10&term={location}"
session = create_session(self.proxy, has_retry=True)
res = self.session.get(url, headers=self.headers)
if res.status_code != 200:
if res.status_code == 429:
err = f"429 Response - Blocked by Glassdoor for too many requests"
logger.error(err)
return None, None
else:
err = f"Glassdoor response status code {res.status_code}"
err += f" - {res.text}"
logger.error(f"Glassdoor response status code {res.status_code}")
return None, None
items = res.json()
if not items:
raise ValueError(f"Location '{location}' not found on Glassdoor")
location_type = items[0]["locationType"]
if location_type == "C":
location_type = "CITY"
elif location_type == "S":
location_type = "STATE"
elif location_type == "N":
location_type = "COUNTRY"
return int(items[0]["locationId"]), location_type
def _add_payload(
self,
location_id: int,
location_type: str,
page_num: int,
cursor: str | None = None,
) -> str:
fromage = None
if self.scraper_input.hours_old:
fromage = max(self.scraper_input.hours_old // 24, 1)
filter_params = []
if self.scraper_input.easy_apply:
filter_params.append({"filterKey": "applicationType", "values": "1"})
if fromage:
filter_params.append({"filterKey": "fromAge", "values": str(fromage)})
payload = {
"operationName": "JobSearchResultsQuery",
"variables": {
"excludeJobListingIds": [],
"filterParams": filter_params,
"keyword": self.scraper_input.search_term,
"numJobsToShow": 30,
"locationType": location_type,
"locationId": int(location_id),
"parameterUrlInput": f"IL.0,12_I{location_type}{location_id}",
"pageNumber": page_num,
"pageCursor": cursor,
"fromage": fromage,
"sort": "date",
},
"query": self.query_template,
}
if self.scraper_input.job_type:
payload["variables"]["filterParams"].append(
{"filterKey": "jobType", "values": self.scraper_input.job_type.value[0]}
)
return json.dumps([payload])
@staticmethod
def parse_compensation(data: dict) -> Optional[Compensation]:
pay_period = data.get("payPeriod")
adjusted_pay = data.get("payPeriodAdjustedPay")
currency = data.get("payCurrency", "USD")
if not pay_period or not adjusted_pay:
return None
interval = None
if pay_period == "ANNUAL":
interval = CompensationInterval.YEARLY
elif pay_period:
interval = CompensationInterval.get_interval(pay_period)
min_amount = int(adjusted_pay.get("p10") // 1)
max_amount = int(adjusted_pay.get("p90") // 1)
return Compensation(
interval=interval,
min_amount=min_amount,
max_amount=max_amount,
currency=currency,
)
@staticmethod
def get_job_type_enum(job_type_str: str) -> list[JobType] | None:
for job_type in JobType:
if job_type_str in job_type.value:
return [job_type]
@staticmethod
def parse_location(location_name: str) -> Location | None:
if not location_name or location_name == "Remote":
return
city, _, state = location_name.partition(", ")
return Location(city=city, state=state)
@staticmethod
def get_cursor_for_page(pagination_cursors, page_num):
for cursor_data in pagination_cursors:
if cursor_data["pageNumber"] == page_num:
return cursor_data["cursor"]
fallback_token = "Ft6oHEWlRZrxDww95Cpazw:0pGUrkb2y3TyOpAIqF2vbPmUXoXVkD3oEGDVkvfeCerceQ5-n8mBg3BovySUIjmCPHCaW0H2nQVdqzbtsYqf4Q:wcqRqeegRUa9MVLJGyujVXB7vWFPjdaS1CtrrzJq-ok"
headers = {
"authority": "www.glassdoor.com",
"accept": "*/*",
"accept-language": "en-US,en;q=0.9",
"apollographql-client-name": "job-search-next",
"apollographql-client-version": "4.65.5",
"content-type": "application/json",
"origin": "https://www.glassdoor.com",
"referer": "https://www.glassdoor.com/",
"sec-ch-ua": '"Chromium";v="118", "Google Chrome";v="118", "Not=A?Brand";v="99"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"macOS"',
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-origin",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36",
}
query_template = """
query JobSearchResultsQuery(
$excludeJobListingIds: [Long!],
$keyword: String,
$locationId: Int,
$locationType: LocationTypeEnum,
$numJobsToShow: Int!,
$pageCursor: String,
$pageNumber: Int,
$filterParams: [FilterParams],
$originalPageUrl: String,
$seoFriendlyUrlInput: String,
$parameterUrlInput: String,
$seoUrl: Boolean
) {
jobListings(
contextHolder: {
searchParams: {
excludeJobListingIds: $excludeJobListingIds,
keyword: $keyword,
locationId: $locationId,
locationType: $locationType,
numPerPage: $numJobsToShow,
pageCursor: $pageCursor,
pageNumber: $pageNumber,
filterParams: $filterParams,
originalPageUrl: $originalPageUrl,
seoFriendlyUrlInput: $seoFriendlyUrlInput,
parameterUrlInput: $parameterUrlInput,
seoUrl: $seoUrl,
searchType: SR
}
}
) {
companyFilterOptions {
id
shortName
__typename
}
filterOptions
indeedCtk
jobListings {
...JobView
__typename
}
jobListingSeoLinks {
linkItems {
position
url
__typename
}
__typename
}
jobSearchTrackingKey
jobsPageSeoData {
pageMetaDescription
pageTitle
__typename
}
paginationCursors {
cursor
pageNumber
__typename
}
indexablePageForSeo
searchResultsMetadata {
searchCriteria {
implicitLocation {
id
localizedDisplayName
type
__typename
}
keyword
location {
id
shortName
localizedShortName
localizedDisplayName
type
__typename
}
__typename
}
helpCenterDomain
helpCenterLocale
jobSerpJobOutlook {
occupation
paragraph
__typename
}
showMachineReadableJobs
__typename
}
totalJobsCount
__typename
}
}
fragment JobView on JobListingSearchResult {
jobview {
header {
adOrderId
advertiserType
adOrderSponsorshipLevel
ageInDays
divisionEmployerName
easyApply
employer {
id
name
shortName
__typename
}
employerNameFromSearch
goc
gocConfidence
gocId
jobCountryId
jobLink
jobResultTrackingKey
jobTitleText
locationName
locationType
locId
needsCommission
payCurrency
payPeriod
payPeriodAdjustedPay {
p10
p50
p90
__typename
}
rating
salarySource
savedJobId
sponsored
__typename
}
job {
description
importConfigId
jobTitleId
jobTitleText
listingId
__typename
}
jobListingAdminDetails {
cpcVal
importConfigId
jobListingId
jobSourceId
userEligibleForAdminJobDetails
__typename
}
overview {
shortName
squareLogoUrl
__typename
}
__typename
}
__typename
}
"""

View File

@ -1,435 +0,0 @@
"""
jobspy.scrapers.indeed
~~~~~~~~~~~~~~~~~~~
This module contains routines to scrape Indeed.
"""
from __future__ import annotations
import math
from typing import Tuple
from datetime import datetime
from concurrent.futures import ThreadPoolExecutor, Future
import requests
from .. import Scraper, ScraperInput, Site
from ..utils import (
extract_emails_from_text,
get_enum_from_job_type,
markdown_converter,
logger,
)
from ...jobs import (
JobPost,
Compensation,
CompensationInterval,
Location,
JobResponse,
JobType,
DescriptionFormat,
)
class IndeedScraper(Scraper):
def __init__(self, proxy: str | None = None):
"""
Initializes IndeedScraper with the Indeed API url
"""
self.scraper_input = None
self.jobs_per_page = 100
self.num_workers = 10
self.seen_urls = set()
self.headers = None
self.api_country_code = None
self.base_url = None
self.api_url = "https://apis.indeed.com/graphql"
site = Site(Site.INDEED)
super().__init__(site, proxy=proxy)
def scrape(self, scraper_input: ScraperInput) -> JobResponse:
"""
Scrapes Indeed for jobs with scraper_input criteria
:param scraper_input:
:return: job_response
"""
self.scraper_input = scraper_input
domain, self.api_country_code = self.scraper_input.country.indeed_domain_value
self.base_url = f"https://{domain}.indeed.com"
self.headers = self.api_headers.copy()
self.headers["indeed-co"] = self.scraper_input.country.indeed_domain_value
job_list = []
page = 1
cursor = None
offset_pages = math.ceil(self.scraper_input.offset / 100)
for _ in range(offset_pages):
logger.info(f"Indeed skipping search page: {page}")
__, cursor = self._scrape_page(cursor)
if not __:
logger.info(f"Indeed found no jobs on page: {page}")
break
while len(self.seen_urls) < scraper_input.results_wanted:
logger.info(f"Indeed search page: {page}")
jobs, cursor = self._scrape_page(cursor)
if not jobs:
logger.info(f"Indeed found no jobs on page: {page}")
break
job_list += jobs
page += 1
return JobResponse(jobs=job_list[: scraper_input.results_wanted])
def _scrape_page(self, cursor: str | None) -> Tuple[list[JobPost], str | None]:
"""
Scrapes a page of Indeed for jobs with scraper_input criteria
:param cursor:
:return: jobs found on page, next page cursor
"""
jobs = []
new_cursor = None
filters = self._build_filters()
search_term = self.scraper_input.search_term.replace('"', '\\"') if self.scraper_input.search_term else ""
query = self.job_search_query.format(
what=(
f'what: "{search_term}"'
if search_term
else ""
),
location=(
f'location: {{where: "{self.scraper_input.location}", radius: {self.scraper_input.distance}, radiusUnit: MILES}}'
if self.scraper_input.location
else ""
),
dateOnIndeed=self.scraper_input.hours_old,
cursor=f'cursor: "{cursor}"' if cursor else "",
filters=filters,
)
payload = {
"query": query,
}
api_headers = self.api_headers.copy()
api_headers["indeed-co"] = self.api_country_code
response = requests.post(
self.api_url,
headers=api_headers,
json=payload,
proxies=self.proxy,
timeout=10,
)
if response.status_code != 200:
logger.info(
f"Indeed responded with status code: {response.status_code} (submit GitHub issue if this appears to be a bug)"
)
return jobs, new_cursor
data = response.json()
jobs = data["data"]["jobSearch"]["results"]
new_cursor = data["data"]["jobSearch"]["pageInfo"]["nextCursor"]
with ThreadPoolExecutor(max_workers=self.num_workers) as executor:
job_results: list[Future] = [
executor.submit(self._process_job, job["job"]) for job in jobs
]
job_list = [result.result() for result in job_results if result.result()]
return job_list, new_cursor
def _build_filters(self):
"""
Builds the filters dict for job type/is_remote. If hours_old is provided, composite filter for job_type/is_remote is not possible.
IndeedApply: filters: { keyword: { field: "indeedApplyScope", keys: ["DESKTOP"] } }
"""
filters_str = ""
if self.scraper_input.hours_old:
filters_str = """
filters: {{
date: {{
field: "dateOnIndeed",
start: "{start}h"
}}
}}
""".format(
start=self.scraper_input.hours_old
)
elif self.scraper_input.easy_apply:
filters_str = """
filters: {
keyword: {
field: "indeedApplyScope",
keys: ["DESKTOP"]
}
}
"""
elif self.scraper_input.job_type or self.scraper_input.is_remote:
job_type_key_mapping = {
JobType.FULL_TIME: "CF3CP",
JobType.PART_TIME: "75GKK",
JobType.CONTRACT: "NJXCK",
JobType.INTERNSHIP: "VDTG7",
}
keys = []
if self.scraper_input.job_type:
key = job_type_key_mapping[self.scraper_input.job_type]
keys.append(key)
if self.scraper_input.is_remote:
keys.append("DSQF7")
if keys:
keys_str = '", "'.join(keys) # Prepare your keys string
filters_str = f"""
filters: {{
composite: {{
filters: [{{
keyword: {{
field: "attributes",
keys: ["{keys_str}"]
}}
}}]
}}
}}
"""
return filters_str
def _process_job(self, job: dict) -> JobPost | None:
"""
Parses the job dict into JobPost model
:param job: dict to parse
:return: JobPost if it's a new job
"""
job_url = f'{self.base_url}/viewjob?jk={job["key"]}'
if job_url in self.seen_urls:
return
self.seen_urls.add(job_url)
description = job["description"]["html"]
if self.scraper_input.description_format == DescriptionFormat.MARKDOWN:
description = markdown_converter(description)
job_type = self._get_job_type(job["attributes"])
timestamp_seconds = job["datePublished"] / 1000
date_posted = datetime.fromtimestamp(timestamp_seconds).strftime("%Y-%m-%d")
employer = job["employer"].get("dossier") if job["employer"] else None
employer_details = employer.get("employerDetails", {}) if employer else {}
rel_url = job["employer"]["relativeCompanyPageUrl"] if job["employer"] else None
return JobPost(
id=str(job["key"]),
title=job["title"],
description=description,
company_name=job["employer"].get("name") if job.get("employer") else None,
company_url=(f"{self.base_url}{rel_url}" if job["employer"] else None),
company_url_direct=(
employer["links"]["corporateWebsite"] if employer else None
),
location=Location(
city=job.get("location", {}).get("city"),
state=job.get("location", {}).get("admin1Code"),
country=job.get("location", {}).get("countryCode"),
),
job_type=job_type,
compensation=self._get_compensation(job),
date_posted=date_posted,
job_url=job_url,
job_url_direct=(
job["recruit"].get("viewJobUrl") if job.get("recruit") else None
),
emails=extract_emails_from_text(description) if description else None,
is_remote=self._is_job_remote(job, description),
company_addresses=(
employer_details["addresses"][0]
if employer_details.get("addresses")
else None
),
company_industry=(
employer_details["industry"]
.replace("Iv1", "")
.replace("_", " ")
.title()
if employer_details.get("industry")
else None
),
company_num_employees=employer_details.get("employeesLocalizedLabel"),
company_revenue=employer_details.get("revenueLocalizedLabel"),
company_description=employer_details.get("briefDescription"),
ceo_name=employer_details.get("ceoName"),
ceo_photo_url=employer_details.get("ceoPhotoUrl"),
logo_photo_url=(
employer["images"].get("squareLogoUrl")
if employer and employer.get("images")
else None
),
banner_photo_url=(
employer["images"].get("headerImageUrl")
if employer and employer.get("images")
else None
),
)
@staticmethod
def _get_job_type(attributes: list) -> list[JobType]:
"""
Parses the attributes to get list of job types
:param attributes:
:return: list of JobType
"""
job_types: list[JobType] = []
for attribute in attributes:
job_type_str = attribute["label"].replace("-", "").replace(" ", "").lower()
job_type = get_enum_from_job_type(job_type_str)
if job_type:
job_types.append(job_type)
return job_types
@staticmethod
def _get_compensation(job: dict) -> Compensation | None:
"""
Parses the job to get compensation
:param job:
:param job:
:return: compensation object
"""
comp = job["compensation"]["baseSalary"]
if not comp:
return None
interval = IndeedScraper._get_compensation_interval(comp["unitOfWork"])
if not interval:
return None
min_range = comp["range"].get("min")
max_range = comp["range"].get("max")
return Compensation(
interval=interval,
min_amount=round(min_range, 2) if min_range is not None else None,
max_amount=round(max_range, 2) if max_range is not None else None,
currency=job["compensation"]["currencyCode"],
)
@staticmethod
def _is_job_remote(job: dict, description: str) -> bool:
"""
Searches the description, location, and attributes to check if job is remote
"""
remote_keywords = ["remote", "work from home", "wfh"]
is_remote_in_attributes = any(
any(keyword in attr["label"].lower() for keyword in remote_keywords)
for attr in job["attributes"]
)
is_remote_in_description = any(
keyword in description.lower() for keyword in remote_keywords
)
is_remote_in_location = any(
keyword in job["location"]["formatted"]["long"].lower()
for keyword in remote_keywords
)
return (
is_remote_in_attributes or is_remote_in_description or is_remote_in_location
)
@staticmethod
def _get_compensation_interval(interval: str) -> CompensationInterval:
interval_mapping = {
"DAY": "DAILY",
"YEAR": "YEARLY",
"HOUR": "HOURLY",
"WEEK": "WEEKLY",
"MONTH": "MONTHLY",
}
mapped_interval = interval_mapping.get(interval.upper(), None)
if mapped_interval and mapped_interval in CompensationInterval.__members__:
return CompensationInterval[mapped_interval]
else:
raise ValueError(f"Unsupported interval: {interval}")
api_headers = {
"Host": "apis.indeed.com",
"content-type": "application/json",
"indeed-api-key": "161092c2017b5bbab13edb12461a62d5a833871e7cad6d9d475304573de67ac8",
"accept": "application/json",
"indeed-locale": "en-US",
"accept-language": "en-US,en;q=0.9",
"user-agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 16_6_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148 Indeed App 193.1",
"indeed-app-info": "appv=193.1; appid=com.indeed.jobsearch; osv=16.6.1; os=ios; dtype=phone",
}
job_search_query = """
query GetJobData {{
jobSearch(
{what}
{location}
includeSponsoredResults: NONE
limit: 100
sort: DATE
{cursor}
{filters}
) {{
pageInfo {{
nextCursor
}}
results {{
trackingKey
job {{
key
title
datePublished
dateOnIndeed
description {{
html
}}
location {{
countryName
countryCode
admin1Code
city
postalCode
streetAddress
formatted {{
short
long
}}
}}
compensation {{
baseSalary {{
unitOfWork
range {{
... on Range {{
min
max
}}
}}
}}
currencyCode
}}
attributes {{
key
label
}}
employer {{
relativeCompanyPageUrl
name
dossier {{
employerDetails {{
addresses
industry
employeesLocalizedLabel
revenueLocalizedLabel
briefDescription
ceoName
ceoPhotoUrl
}}
images {{
headerImageUrl
squareLogoUrl
}}
links {{
corporateWebsite
}}
}}
}}
recruit {{
viewJobUrl
detailedSalary
workSchedule
}}
}}
}}
}}
}}
"""

View File

@ -1,113 +0,0 @@
from __future__ import annotations
import re
import logging
import requests
import tls_client
import numpy as np
from markdownify import markdownify as md
from requests.adapters import HTTPAdapter, Retry
from ..jobs import JobType
logger = logging.getLogger("JobSpy")
logger.propagate = False
if not logger.handlers:
logger.setLevel(logging.INFO)
console_handler = logging.StreamHandler()
format = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
formatter = logging.Formatter(format)
console_handler.setFormatter(formatter)
logger.addHandler(console_handler)
def set_logger_level(verbose: int = 2):
"""
Adjusts the logger's level. This function allows the logging level to be changed at runtime.
Parameters:
- verbose: int {0, 1, 2} (default=2, all logs)
"""
if verbose is None:
return
level_name = {2: "INFO", 1: "WARNING", 0: "ERROR"}.get(verbose, "INFO")
level = getattr(logging, level_name.upper(), None)
if level is not None:
logger.setLevel(level)
else:
raise ValueError(f"Invalid log level: {level_name}")
def markdown_converter(description_html: str):
if description_html is None:
return None
markdown = md(description_html)
return markdown.strip()
def extract_emails_from_text(text: str) -> list[str] | None:
if not text:
return None
email_regex = re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
return email_regex.findall(text)
def create_session(
proxy: dict | None = None,
is_tls: bool = True,
has_retry: bool = False,
delay: int = 1,
) -> requests.Session:
"""
Creates a requests session with optional tls, proxy, and retry settings.
:return: A session object
"""
if is_tls:
session = tls_client.Session(random_tls_extension_order=True)
session.proxies = proxy
else:
session = requests.Session()
session.allow_redirects = True
if proxy:
session.proxies.update(proxy)
if has_retry:
retries = Retry(
total=3,
connect=3,
status=3,
status_forcelist=[500, 502, 503, 504, 429],
backoff_factor=delay,
)
adapter = HTTPAdapter(max_retries=retries)
session.mount("http://", adapter)
session.mount("https://", adapter)
return session
def get_enum_from_job_type(job_type_str: str) -> JobType | None:
"""
Given a string, returns the corresponding JobType enum member if a match is found.
"""
res = None
for job_type in JobType:
if job_type_str in job_type.value:
res = job_type
return res
def currency_parser(cur_str):
# Remove any non-numerical characters
# except for ',' '.' or '-' (e.g. EUR)
cur_str = re.sub("[^-0-9.,]", "", cur_str)
# Remove any 000s separators (either , or .)
cur_str = re.sub("[.,]", "", cur_str[:-3]) + cur_str[-3:]
if "." in list(cur_str[-3:]):
num = float(cur_str)
elif "," in list(cur_str[-3:]):
num = float(cur_str.replace(",", "."))
else:
num = float(cur_str)
return np.round(num, 2)

View File

View File

@ -1,14 +0,0 @@
from ..jobspy import scrape_jobs
import pandas as pd
def test_all():
result = scrape_jobs(
site_name=["linkedin", "indeed", "zip_recruiter", "glassdoor"],
search_term="software engineer",
results_wanted=5,
)
assert (
isinstance(result, pd.DataFrame) and not result.empty
), "Result should be a non-empty DataFrame"

View File

@ -1,11 +0,0 @@
from ..jobspy import scrape_jobs
import pandas as pd
def test_indeed():
result = scrape_jobs(
site_name="glassdoor", search_term="software engineer", country_indeed="USA"
)
assert (
isinstance(result, pd.DataFrame) and not result.empty
), "Result should be a non-empty DataFrame"

View File

@ -1,11 +0,0 @@
from ..jobspy import scrape_jobs
import pandas as pd
def test_indeed():
result = scrape_jobs(
site_name="indeed", search_term="software engineer", country_indeed="usa"
)
assert (
isinstance(result, pd.DataFrame) and not result.empty
), "Result should be a non-empty DataFrame"

View File

@ -1,12 +0,0 @@
from ..jobspy import scrape_jobs
import pandas as pd
def test_linkedin():
result = scrape_jobs(
site_name="linkedin",
search_term="software engineer",
)
assert (
isinstance(result, pd.DataFrame) and not result.empty
), "Result should be a non-empty DataFrame"

View File

@ -1,13 +0,0 @@
from ..jobspy import scrape_jobs
import pandas as pd
def test_ziprecruiter():
result = scrape_jobs(
site_name="zip_recruiter",
search_term="software engineer",
)
assert (
isinstance(result, pd.DataFrame) and not result.empty
), "Result should be a non-empty DataFrame"