Compare commits

...

83 Commits

Author SHA1 Message Date
Cullen Watson
7cb0c518fc docs:readme 2025-02-21 12:53:59 -06:00
Cullen Watson
df70d4bc2e minor 2025-02-21 12:35:31 -06:00
Cullen Watson
3006063875 enh:remove log by default 2025-02-21 12:31:04 -06:00
Abdulrahman Hisham
1be009b8bc Adding Bayt.com Scraper to current codebase (#246) 2025-02-21 12:29:54 -06:00
Cullen Watson
81ed9b3ddf enh:remove log by default 2025-02-21 12:29:28 -06:00
Abdulrahman Al Muaitah
11a9e9a56a Fixed Bayt scraper integration 2025-02-21 20:10:02 +04:00
Abdulrahman Al Muaitah
c6ade14784 Added Bayt Scraper integration 2025-02-21 15:31:29 +04:00
Cullen Watson
13c74a0fed docs:readme 2025-02-09 13:42:18 -06:00
Cullen Watson
333e9e6760 docs:readme 2025-01-17 21:44:49 -06:00
github-actions
04032a0f91 Increment version 2024-12-04 22:55:06 +00:00
Cullen Watson
496896d0b5 enh:fix yml (#225) 2024-12-04 16:54:52 -06:00
Cullen Watson
87ba1ad1bf fix yml 2024-12-04 16:52:15 -06:00
Jason Geffner
4e7ac9a583 Fix Google job search (#223)
The previous regex did not capture all expected matches in the returned content
2024-12-04 16:45:59 -06:00
Cullen Watson
e44d13e1cf enh:auto update version 2024-12-04 16:29:38 -06:00
Cullen Watson
d52e366ef7 docs:readme 2024-11-26 15:51:26 -06:00
Cullen Watson
395ebf0017 docs:readme 2024-11-26 15:49:12 -06:00
Cullen Watson
63fddd9b7f docs:readme 2024-11-26 15:48:22 -06:00
Cullen Watson
58956868ae docs:readme 2024-11-26 15:47:10 -06:00
Cullen Watson
4fce836222 docs:readme 2024-10-28 03:53:59 -05:00
Cullen Watson
5ba25e7a7c docs:readme 2024-10-28 03:42:19 -05:00
Cullen Watson
f7cb3e9206 docs:readme 2024-10-28 03:36:21 -05:00
Cullen Watson
3ad3f121f7 docs:readme 2024-10-28 03:34:52 -05:00
Cullen Watson
ff3c782912 docs:readme 2024-10-25 18:12:08 -05:00
Cullen Watson
338d854b96 fix(google): search (#216) 2024-10-25 14:54:14 -05:00
Cullen Watson
811d4c40b4 chore:version 2024-10-24 15:28:25 -05:00
Cullen Watson
dba92d22c2 chore:version 2024-10-24 15:27:16 -05:00
Cullen Watson
10a3592a0f docs:file 2024-10-24 15:26:49 -05:00
Cullen Watson
b7905cc756 docs:file 2024-10-24 15:24:18 -05:00
Cullen Watson
6867d58829 docs:readme 2024-10-24 15:22:31 -05:00
Cullen Watson
f6248c8386 enh: google jobs (#214) 2024-10-24 15:19:40 -05:00
Cullen Watson
f395597fdd fix(indeed): offset 2024-10-22 19:25:07 -05:00
Cullen Watson
6372e41bd9 chore:version 2024-10-20 00:19:31 -05:00
Olzhas Arystanov
6c869decb8 build(deps): bump markdownify to 0.13.1 (#211) 2024-10-20 00:18:44 -05:00
Cullen Watson
9f4083380d indeed:remove tpe (#210) 2024-10-19 18:01:59 -05:00
Olzhas Arystanov
9207ab56f6 fix: extract tests out of src (#209) 2024-10-19 16:56:38 -05:00
Cullen Watson
757a94853e chore:version 2024-10-08 17:49:06 -05:00
Marcel Gozalbo Baró
6bc191d5c7 FEATURE: Add the "ca_cert" setting for providing a Certification Authority certificate in order to use proxies requiring it. (#204) 2024-10-08 17:46:46 -05:00
Cullen Watson
0cc34287f7 fix:turkey 2024-10-02 01:31:00 -05:00
Anton Pikhteryev
923979093b Add Malta for linkedin country support (#198) 2024-09-19 20:41:22 -05:00
Cullen Watson
286f0e4487 docs:readme 2024-09-18 18:49:41 -05:00
Cullen Watson
f7b29d43a2 fix(indeed):sort relevance not date (#197) 2024-09-18 18:42:25 -05:00
Cullen Watson
6f1490458c fix key error (#186) 2024-08-14 02:54:40 -05:00
Cullen Watson
6bb7d81ba8 change linkedin ep (#185) 2024-08-14 02:39:43 -05:00
Cullen Watson
0e046432d1 fix:variable bug (#181) 2024-08-05 12:47:55 -05:00
Cullen Watson
209e0e65b6 fix:malaysia indeed (#180) 2024-08-03 22:48:53 -05:00
Cullen Watson
8570c0651e fix:key error (#176) 2024-07-21 13:05:18 -05:00
Cullen Watson
8678b0bbe4 enh: test on pr (#174) 2024-07-19 14:25:25 -05:00
Cullen Watson
60d4d911c9 lock file (#173) 2024-07-17 21:21:22 -05:00
Lluís Salord Quetglas
2a0cba8c7e FEAT: Optional convertion to annual and know salary source (#170) 2024-07-17 21:05:33 -05:00
Mason DePalma
de70189fa2 Update pyproject.toml (#172)
Changed Numpy to the most recent version so the package can properly install
2024-07-17 20:54:08 -05:00
Cullen Watson
b55c0eb86d docs:readme 2024-07-16 19:24:38 -05:00
Cullen Watson
88c95c4ad5 enh: estimated salary (#169) 2024-07-16 19:20:34 -05:00
Cullen Watson
d8d33d602f docs: readme 2024-07-15 21:30:11 -05:00
Cullen Watson
6330c14879 minor fix 2024-07-15 21:19:01 -05:00
Ali Bakhshi Ilani
48631ea271 Add company industry and job level to linkedin scraper (#166) 2024-07-15 21:07:39 -05:00
Cullen Watson
edffe18e65 enh: listing source (#168) 2024-07-15 20:30:04 -05:00
Lluís Salord Quetglas
0988230a24 FEAT: Add Glassdoor logo data if available (#167) 2024-07-15 20:25:18 -05:00
Cullen Watson
d000a81eb3 Salary parse (#163) 2024-06-09 17:45:38 -05:00
Cullen Watson
ccb0c17660 enh: ziprecruiter full description (#162) 2024-06-09 16:21:01 -05:00
Cullen Watson
df339610fa docs: readme 2024-05-29 19:32:32 -05:00
Cullen Watson
c501006bd8 docs: readme 2024-05-28 16:04:26 -05:00
Cullen Watson
89a3ee231c enh(li): job function (#160) 2024-05-28 16:01:29 -05:00
Cullen
6439f71433 chore: version 2024-05-28 15:39:24 -05:00
adamagassi
7f6271b2e0 LinkedIn scraper fixes: (#159)
Correct initial page offset calculation
Separate page variable from request counter
Fix job offset starting value
Increment offset by number of jobs returned instead of expected value
2024-05-28 15:38:13 -05:00
Cullen Watson
5cb7ffe5fd enh: proxies (#157)
* enh: proxies

* enh: proxies
2024-05-25 14:04:09 -05:00
Cullen Watson
cd29f79796 docs: readme 2024-05-25 11:46:23 -05:00
Cullen Watson
65d2e5e707 Update pyproject.toml 2024-05-20 11:46:36 -05:00
fasih hussain
08d63a87a2 chore: id added for JobPost schema (#152) 2024-05-20 11:45:52 -05:00
Cullen
1ffdb1756f fix: dup line 2024-04-30 12:11:48 -05:00
Cullen Watson
1185693422 delete empty file 2024-04-30 12:06:20 -05:00
Lluís Salord Quetglas
dcd7144318 FIX: Allow Indeed search term with complex syntax (#139) 2024-04-30 12:05:43 -05:00
Cullen Watson
bf73c061bd enh: linkedin company logo (#141) 2024-04-30 12:03:10 -05:00
Lluís Salord Quetglas
8dd08ed9fd FEAT: Allow LinkedIn scraper to get external job apply url (#140) 2024-04-30 11:36:01 -05:00
Cullen Watson
5d3df732e6 docs: readme 2024-03-12 20:46:25 -05:00
Kellen Mace
86f858e06d Update scrape_jobs() parameters info in readme (#130) 2024-03-12 20:45:13 -05:00
Cullen
1089d1f0a5 docs: readme 2024-03-11 21:30:57 -05:00
Cullen
3e93454738 fix(indeed): readd param 2024-03-11 21:23:20 -05:00
Cullen Watson
0d150d519f docs: readme 2024-03-11 14:52:20 -05:00
Cullen Watson
cc3497f929 docs: readme 2024-03-11 14:45:17 -05:00
Cullen Watson
5986f75346 docs: readme 2024-03-11 14:41:12 -05:00
VitaminB16
4b7bdb9313 feat: Adjust log verbosity via verbose arg (#128) 2024-03-11 14:38:44 -05:00
Cullen Watson
80213f28d2 chore: version 2024-03-11 09:43:12 -05:00
Cullen Watson
ada38532c3 fix: indeed empty location term 2024-03-11 09:42:43 -05:00
31 changed files with 3067 additions and 2144 deletions

View File

@@ -1,33 +1,50 @@
name: Publish Python 🐍 distributions 📦 to PyPI name: Publish Python 🐍 distributions 📦 to PyPI
on: push on:
pull_request:
types:
- closed
permissions:
contents: write
jobs: jobs:
build-n-publish: build-n-publish:
name: Build and publish Python 🐍 distributions 📦 to PyPI name: Build and publish Python 🐍 distributions 📦 to PyPI
runs-on: ubuntu-latest runs-on: ubuntu-latest
if: github.event.pull_request.merged == true && github.event.pull_request.base.ref == 'main'
steps: steps:
- uses: actions/checkout@v3 - uses: actions/checkout@v3
- name: Set up Python - name: Set up Python
uses: actions/setup-python@v4 uses: actions/setup-python@v4
with: with:
python-version: "3.10" python-version: "3.10"
- name: Install dependencies
run: pip install toml
- name: Increment version
run: python increment_version.py
- name: Commit version increment
run: |
git config --global user.name 'github-actions'
git config --global user.email 'github-actions@github.com'
git add pyproject.toml
git commit -m 'Increment version'
- name: Push changes
run: git push
- name: Install poetry - name: Install poetry
run: >- run: pip install poetry --user
python3 -m
pip install
poetry
--user
- name: Build distribution 📦 - name: Build distribution 📦
run: >- run: poetry build
python3 -m
poetry
build
- name: Publish distribution 📦 to PyPI - name: Publish distribution 📦 to PyPI
if: startsWith(github.ref, 'refs/tags')
uses: pypa/gh-action-pypi-publish@release/v1 uses: pypa/gh-action-pypi-publish@release/v1
with: with:
password: ${{ secrets.PYPI_API_TOKEN }} password: ${{ secrets.PYPI_API_TOKEN }}

220
README.md
View File

@@ -1,20 +1,12 @@
<img src="https://github.com/cullenwatson/JobSpy/assets/78247585/ae185b7e-e444-4712-8bb9-fa97f53e896b" width="400"> <img src="https://github.com/cullenwatson/JobSpy/assets/78247585/ae185b7e-e444-4712-8bb9-fa97f53e896b" width="400">
**JobSpy** is a simple, yet comprehensive, job scraping library. **JobSpy** is a job scraping library with the goal of aggregating all the jobs from popular job boards with one tool.
**Not technical?** Try out the web scraping tool on our site at [usejobspy.com](https://usejobspy.com).
*Looking to build a data-focused software product?* **[Book a call](https://bunsly.com/)** *to
work with us.*
## Features ## Features
- Scrapes job postings from **LinkedIn**, **Indeed**, **Glassdoor**, & **ZipRecruiter** simultaneously - Scrapes job postings from **LinkedIn**, **Indeed**, **Glassdoor**, **Google**, **ZipRecruiter**, & **Bayt** concurrently
- Aggregates the job postings in a Pandas DataFrame - Aggregates the job postings in a dataframe
- Proxy support - Proxies support to bypass blocking
[Video Guide for JobSpy](https://www.youtube.com/watch?v=RuP1HrAZnxs&pp=ygUgam9icyBzY3JhcGVyIGJvdCBsaW5rZWRpbiBpbmRlZWQ%3D) -
Updated for release v1.1.3
![jobspy](https://github.com/cullenwatson/JobSpy/assets/78247585/ec7ef355-05f6-4fd3-8161-a817e31c5c57) ![jobspy](https://github.com/cullenwatson/JobSpy/assets/78247585/ec7ef355-05f6-4fd3-8161-a817e31c5c57)
@@ -33,22 +25,26 @@ import csv
from jobspy import scrape_jobs from jobspy import scrape_jobs
jobs = scrape_jobs( jobs = scrape_jobs(
site_name=["indeed", "linkedin", "zip_recruiter", "glassdoor"], site_name=["indeed", "linkedin", "zip_recruiter", "glassdoor", "google", "bayt"],
search_term="software engineer", search_term="software engineer",
location="Dallas, TX", google_search_term="software engineer jobs near San Francisco, CA since yesterday",
location="San Francisco, CA",
results_wanted=20, results_wanted=20,
hours_old=72, # (only Linkedin/Indeed is hour specific, others round up to days old) hours_old=72,
country_indeed='USA' # only needed for indeed / glassdoor country_indeed='USA',
# linkedin_fetch_description=True # gets more info such as description, direct job url (slower)
# proxies=["208.195.175.46:65095", "208.195.175.45:65095", "localhost"],
) )
print(f"Found {len(jobs)} jobs") print(f"Found {len(jobs)} jobs")
print(jobs.head()) print(jobs.head())
jobs.to_csv("jobs.csv", quoting=csv.QUOTE_NONNUMERIC, escapechar="\\", index=False) # to_xlsx jobs.to_csv("jobs.csv", quoting=csv.QUOTE_NONNUMERIC, escapechar="\\", index=False) # to_excel
``` ```
### Output ### Output
``` ```
SITE TITLE COMPANY_NAME CITY STATE JOB_TYPE INTERVAL MIN_AMOUNT MAX_AMOUNT JOB_URL DESCRIPTION SITE TITLE COMPANY CITY STATE JOB_TYPE INTERVAL MIN_AMOUNT MAX_AMOUNT JOB_URL DESCRIPTION
indeed Software Engineer AMERICAN SYSTEMS Arlington VA None yearly 200000 150000 https://www.indeed.com/viewjob?jk=5e409e577046... THIS POSITION COMES WITH A 10K SIGNING BONUS!... indeed Software Engineer AMERICAN SYSTEMS Arlington VA None yearly 200000 150000 https://www.indeed.com/viewjob?jk=5e409e577046... THIS POSITION COMES WITH A 10K SIGNING BONUS!...
indeed Senior Software Engineer TherapyNotes.com Philadelphia PA fulltime yearly 135000 110000 https://www.indeed.com/viewjob?jk=da39574a40cb... About Us TherapyNotes is the national leader i... indeed Senior Software Engineer TherapyNotes.com Philadelphia PA fulltime yearly 135000 110000 https://www.indeed.com/viewjob?jk=da39574a40cb... About Us TherapyNotes is the national leader i...
linkedin Software Engineer - Early Career Lockheed Martin Sunnyvale CA fulltime yearly None None https://www.linkedin.com/jobs/view/3693012711 Description:By bringing together people that u... linkedin Software Engineer - Early Career Lockheed Martin Sunnyvale CA fulltime yearly None None https://www.linkedin.com/jobs/view/3693012711 Description:By bringing together people that u...
@@ -60,66 +56,84 @@ zip_recruiter Software Developer TEKsystems Phoenix
### Parameters for `scrape_jobs()` ### Parameters for `scrape_jobs()`
```plaintext ```plaintext
Required
├── site_type (List[enum]): linkedin, zip_recruiter, indeed, glassdoor
└── search_term (str)
Optional Optional
├── site_name (list|str):
| linkedin, zip_recruiter, indeed, glassdoor, google, bayt
| (default is all)
├── search_term (str)
|
├── google_search_term (str)
| search term for google jobs. This is the only param for filtering google jobs.
├── location (str) ├── location (str)
├── distance (int): in miles, default 50
├── job_type (enum): fulltime, parttime, internship, contract ├── distance (int):
├── proxy (str): in format 'http://user:pass@host:port' | in miles, default 50
├── job_type (str):
| fulltime, parttime, internship, contract
├── proxies (list):
| in format ['user:pass@host:port', 'localhost']
| each job board scraper will round robin through the proxies
|
├── is_remote (bool) ├── is_remote (bool)
├── linkedin_fetch_description (bool): fetches full description for LinkedIn (slower)
├── results_wanted (int): number of job results to retrieve for each site specified in 'site_type' ├── results_wanted (int):
├── easy_apply (bool): filters for jobs that are hosted on the job board site (not supported on Indeed) | number of job results to retrieve for each site specified in 'site_name'
├── linkedin_company_ids (list[int): searches for linkedin jobs with specific company ids
├── description_format (enum): markdown, html (format type of the job descriptions) ├── easy_apply (bool):
├── country_indeed (enum): filters the country on Indeed (see below for correct spelling) | filters for jobs that are hosted on the job board site (LinkedIn easy apply filter no longer works)
├── offset (num): starts the search from an offset (e.g. 25 will start the search from the 25th result)
├── hours_old (int): filters jobs by the number of hours since the job was posted (ZipRecruiter and Glassdoor round up to next day. If you use this on Indeed, it will not filter by job_type or is_remote) ├── description_format (str):
| markdown, html (Format type of the job descriptions. Default is markdown.)
├── offset (int):
| starts the search from an offset (e.g. 25 will start the search from the 25th result)
├── hours_old (int):
| filters jobs by the number of hours since the job was posted
| (ZipRecruiter and Glassdoor round up to next day.)
├── verbose (int) {0, 1, 2}:
| Controls the verbosity of the runtime printouts
| (0 prints only errors, 1 is errors+warnings, 2 is all logs. Default is 2.)
├── linkedin_fetch_description (bool):
| fetches full description and direct job url for LinkedIn (Increases requests by O(n))
├── linkedin_company_ids (list[int]):
| searches for linkedin jobs with specific company ids
|
├── country_indeed (str):
| filters the country on Indeed & Glassdoor (see below for correct spelling)
|
├── enforce_annual_salary (bool):
| converts wages to annual salary
|
├── ca_cert (str)
| path to CA Certificate file for proxies
``` ```
### JobPost Schema ```
├── Indeed limitations:
```plaintext | Only one from this list can be used in a search:
JobPost | - hours_old
├── title (str) | - job_type & is_remote
├── company (str) | - easy_apply
├── company_url (str)
── job_url (str) ── LinkedIn limitations:
├── location (object) | Only one from this list can be used in a search:
├── country (str) | - hours_old
├── city (str) | - easy_apply
│ ├── state (str)
├── description (str)
├── job_type (str): fulltime, parttime, internship, contract
├── compensation (object)
│ ├── interval (str): yearly, monthly, weekly, daily, hourly
│ ├── min_amount (int)
│ ├── max_amount (int)
│ └── currency (enum)
└── date_posted (date)
└── emails (str)
└── is_remote (bool)
Indeed specific
├── company_country (str)
└── company_addresses (str)
└── company_industry (str)
└── company_employees_label (str)
└── company_revenue_label (str)
└── company_description (str)
└── ceo_name (str)
└── ceo_photo_url (str)
└── logo_photo_url (str)
└── banner_photo_url (str)
``` ```
## Supported Countries for Job Searching ## Supported Countries for Job Searching
### **LinkedIn** ### **LinkedIn**
LinkedIn searches globally & uses only the `location` parameter. You can only fetch 1000 jobs max from the LinkedIn endpoint we are using LinkedIn searches globally & uses only the `location` parameter.
### **ZipRecruiter** ### **ZipRecruiter**
@@ -151,26 +165,84 @@ You can specify the following countries when searching on Indeed (use the exact
| United Arab Emirates | UK* | USA* | Uruguay | | United Arab Emirates | UK* | USA* | Uruguay |
| Venezuela | Vietnam* | | | | Venezuela | Vietnam* | | |
### **Bayt**
Bayt only uses the search_term parameter currently and searches internationally
## Notes ## Notes
* Indeed is the best scraper currently with no rate limiting. * Indeed is the best scraper currently with no rate limiting.
* Glassdoor/Ziprecruiter can only fetch 900/1000 jobs from the endpoints we are using on a given search. * All the job board endpoints are capped at around 1000 jobs on a given search.
* LinkedIn is the most restrictive and usually rate limits around the 10th page. * LinkedIn is the most restrictive and usually rate limits around the 10th page with one ip. Proxies are a must basically.
## Frequently Asked Questions ## Frequently Asked Questions
--- ---
**Q: Why is Indeed giving unrelated roles?**
**A:** Indeed searches the description too.
**Q: Encountering issues with your queries?** - use - to remove words
**A:** Try reducing the number of `results_wanted` and/or broadening the filters. If problems - "" for exact match
persist, [submit an issue](https://github.com/Bunsly/JobSpy/issues).
Example of a good Indeed query
```py
search_term='"engineering intern" software summer (java OR python OR c++) 2025 -tax -marketing'
```
This searches the description/title and must include software, summer, 2025, one of the languages, engineering intern exactly, no tax, no marketing.
---
**Q: No results when using "google"?**
**A:** You have to use super specific syntax. Search for google jobs on your browser and then whatever pops up in the google jobs search box after applying some filters is what you need to copy & paste into the google_search_term.
--- ---
**Q: Received a response code 429?** **Q: Received a response code 429?**
**A:** This indicates that you have been blocked by the job board site for sending too many requests. All of the job board sites are aggressive with blocking. We recommend: **A:** This indicates that you have been blocked by the job board site for sending too many requests. All of the job board sites are aggressive with blocking. We recommend:
- Waiting some time between scrapes (site-dependent). - Wait some time between scrapes (site-dependent).
- Trying a VPN or proxy to change your IP address. - Try using the proxies param to change your IP address.
--- ---
### JobPost Schema
```plaintext
JobPost
├── title
├── company
├── company_url
├── job_url
├── location
│ ├── country
│ ├── city
│ ├── state
├── description
├── job_type: fulltime, parttime, internship, contract
├── job_function
│ ├── interval: yearly, monthly, weekly, daily, hourly
│ ├── min_amount
│ ├── max_amount
│ ├── currency
│ └── salary_source: direct_data, description (parsed from posting)
├── date_posted
├── emails
└── is_remote
Linkedin specific
└── job_level
Linkedin & Indeed specific
└── company_industry
Indeed specific
├── company_country
├── company_addresses
├── company_employees_label
├── company_revenue_label
├── company_description
└── company_logo
```

View File

@@ -1,30 +0,0 @@
from jobspy import scrape_jobs
import pandas as pd
jobs: pd.DataFrame = scrape_jobs(
site_name=["indeed", "linkedin", "zip_recruiter", "glassdoor"],
search_term="software engineer",
location="Dallas, TX",
results_wanted=25, # be wary the higher it is, the more likey you'll get blocked (rotating proxy can help tho)
country_indeed="USA",
# proxy="http://jobspy:5a4vpWtj8EeJ2hoYzk@ca.smartproxy.com:20001",
)
# formatting for pandas
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option("display.width", None)
pd.set_option("display.max_colwidth", 50) # set to 0 to see full job url / desc
# 1: output to console
print(jobs)
# 2: output to .csv
jobs.to_csv("./jobs.csv", index=False)
print("outputted to jobs.csv")
# 3: output to .xlsx
# jobs.to_xlsx('jobs.xlsx', index=False)
# 4: display in Jupyter Notebook (1. pip install jupyter 2. jupyter notebook)
# display(jobs)

View File

@@ -1,167 +0,0 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "00a94b47-f47b-420f-ba7e-714ef219c006",
"metadata": {},
"outputs": [],
"source": [
"from jobspy import scrape_jobs\n",
"import pandas as pd\n",
"from IPython.display import display, HTML"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9f773e6c-d9fc-42cc-b0ef-63b739e78435",
"metadata": {},
"outputs": [],
"source": [
"pd.set_option('display.max_columns', None)\n",
"pd.set_option('display.max_rows', None)\n",
"pd.set_option('display.width', None)\n",
"pd.set_option('display.max_colwidth', 50)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1253c1f8-9437-492e-9dd3-e7fe51099420",
"metadata": {},
"outputs": [],
"source": [
"# example 1 (no hyperlinks, USA)\n",
"jobs = scrape_jobs(\n",
" site_name=[\"linkedin\"],\n",
" location='san francisco',\n",
" search_term=\"engineer\",\n",
" results_wanted=5,\n",
"\n",
" # use if you want to use a proxy\n",
" # proxy=\"socks5://jobspy:5a4vpWtj4EeJ2hoYzk@us.smartproxy.com:10001\",\n",
" proxy=\"http://jobspy:5a4vpWtj4EeJ2hoYzk@us.smartproxy.com:10001\",\n",
" #proxy=\"https://jobspy:5a4vpWtj4EeJ2hoYzk@us.smartproxy.com:10001\",\n",
")\n",
"display(jobs)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6a581b2d-f7da-4fac-868d-9efe143ee20a",
"metadata": {},
"outputs": [],
"source": [
"# example 2 - remote USA & hyperlinks\n",
"jobs = scrape_jobs(\n",
" site_name=[\"linkedin\", \"zip_recruiter\", \"indeed\"],\n",
" # location='san francisco',\n",
" search_term=\"software engineer\",\n",
" country_indeed=\"USA\",\n",
" hyperlinks=True,\n",
" is_remote=True,\n",
" results_wanted=5, \n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fe8289bc-5b64-4202-9a64-7c117c83fd9a",
"metadata": {},
"outputs": [],
"source": [
"# use if hyperlinks=True\n",
"html = jobs.to_html(escape=False)\n",
"# change max-width: 200px to show more or less of the content\n",
"truncate_width = f'<style>.dataframe td {{ max-width: 200px; overflow: hidden; text-overflow: ellipsis; white-space: nowrap; }}</style>{html}'\n",
"display(HTML(truncate_width))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "951c2fe1-52ff-407d-8bb1-068049b36777",
"metadata": {},
"outputs": [],
"source": [
"# example 3 - with hyperlinks, international - linkedin (no zip_recruiter)\n",
"jobs = scrape_jobs(\n",
" site_name=[\"linkedin\"],\n",
" location='berlin',\n",
" search_term=\"engineer\",\n",
" hyperlinks=True,\n",
" results_wanted=5,\n",
" easy_apply=True\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1e37a521-caef-441c-8fc2-2eb5b2e7da62",
"metadata": {},
"outputs": [],
"source": [
"# use if hyperlinks=True\n",
"html = jobs.to_html(escape=False)\n",
"# change max-width: 200px to show more or less of the content\n",
"truncate_width = f'<style>.dataframe td {{ max-width: 200px; overflow: hidden; text-overflow: ellipsis; white-space: nowrap; }}</style>{html}'\n",
"display(HTML(truncate_width))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0650e608-0b58-4bf5-ae86-68348035b16a",
"metadata": {},
"outputs": [],
"source": [
"# example 4 - international indeed (no zip_recruiter)\n",
"jobs = scrape_jobs(\n",
" site_name=[\"indeed\"],\n",
" search_term=\"engineer\",\n",
" country_indeed = \"China\",\n",
" hyperlinks=True\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "40913ac8-3f8a-4d7e-ac47-afb88316432b",
"metadata": {},
"outputs": [],
"source": [
"# use if hyperlinks=True\n",
"html = jobs.to_html(escape=False)\n",
"# change max-width: 200px to show more or less of the content\n",
"truncate_width = f'<style>.dataframe td {{ max-width: 200px; overflow: hidden; text-overflow: ellipsis; white-space: nowrap; }}</style>{html}'\n",
"display(HTML(truncate_width))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -1,77 +0,0 @@
from jobspy import scrape_jobs
import pandas as pd
import os
import time
# creates csv a new filename if the jobs.csv already exists.
csv_filename = "jobs.csv"
counter = 1
while os.path.exists(csv_filename):
csv_filename = f"jobs_{counter}.csv"
counter += 1
# results wanted and offset
results_wanted = 1000
offset = 0
all_jobs = []
# max retries
max_retries = 3
# nuumber of results at each iteration
results_in_each_iteration = 30
while len(all_jobs) < results_wanted:
retry_count = 0
while retry_count < max_retries:
print("Doing from", offset, "to", offset + results_in_each_iteration, "jobs")
try:
jobs = scrape_jobs(
site_name=["indeed"],
search_term="software engineer",
# New York, NY
# Dallas, TX
# Los Angeles, CA
location="Los Angeles, CA",
results_wanted=min(results_in_each_iteration, results_wanted - len(all_jobs)),
country_indeed="USA",
offset=offset,
# proxy="http://jobspy:5a4vpWtj8EeJ2hoYzk@ca.smartproxy.com:20001",
)
# Add the scraped jobs to the list
all_jobs.extend(jobs.to_dict('records'))
# Increment the offset for the next page of results
offset += results_in_each_iteration
# Add a delay to avoid rate limiting (you can adjust the delay time as needed)
print(f"Scraped {len(all_jobs)} jobs")
print("Sleeping secs", 100 * (retry_count + 1))
time.sleep(100 * (retry_count + 1)) # Sleep for 2 seconds between requests
break # Break out of the retry loop if successful
except Exception as e:
print(f"Error: {e}")
retry_count += 1
print("Sleeping secs before retry", 100 * (retry_count + 1))
time.sleep(100 * (retry_count + 1))
if retry_count >= max_retries:
print("Max retries reached. Exiting.")
break
# DataFrame from the collected job data
jobs_df = pd.DataFrame(all_jobs)
# Formatting
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option("display.width", None)
pd.set_option("display.max_colwidth", 50)
print(jobs_df)
jobs_df.to_csv(csv_filename, index=False)
print(f"Outputted to {csv_filename}")

21
increment_version.py Normal file
View File

@@ -0,0 +1,21 @@
import toml
def increment_version(version):
major, minor, patch = map(int, version.split('.'))
patch += 1
return f"{major}.{minor}.{patch}"
# Load pyproject.toml
with open('pyproject.toml', 'r') as file:
pyproject = toml.load(file)
# Increment the version
current_version = pyproject['tool']['poetry']['version']
new_version = increment_version(current_version)
pyproject['tool']['poetry']['version'] = new_version
# Save the updated pyproject.toml
with open('pyproject.toml', 'w') as file:
toml.dump(pyproject, file)
print(f"Version updated from {current_version} to {new_version}")

2569
poetry.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -1,35 +1,35 @@
[build-system]
requires = [ "poetry-core",]
build-backend = "poetry.core.masonry.api"
[tool.poetry] [tool.poetry]
name = "python-jobspy" name = "python-jobspy"
version = "1.1.49" version = "1.1.76"
description = "Job scraper for LinkedIn, Indeed, Glassdoor & ZipRecruiter" description = "Job scraper for LinkedIn, Indeed, Glassdoor & ZipRecruiter"
authors = ["Zachary Hampton <zachary@bunsly.com>", "Cullen Watson <cullen@bunsly.com>"] authors = [ "Zachary Hampton <zachary@bunsly.com>", "Cullen Watson <cullen@bunsly.com>",]
homepage = "https://github.com/Bunsly/JobSpy" homepage = "https://github.com/Bunsly/JobSpy"
readme = "README.md" readme = "README.md"
keywords = [ "jobs-scraper", "linkedin", "indeed", "glassdoor", "ziprecruiter",]
[[tool.poetry.packages]]
include = "jobspy"
from = "src"
packages = [ [tool.black]
{ include = "jobspy", from = "src" } line-length = 88
]
[tool.poetry.dependencies] [tool.poetry.dependencies]
python = "^3.10" python = "^3.10"
requests = "^2.31.0" requests = "^2.31.0"
beautifulsoup4 = "^4.12.2" beautifulsoup4 = "^4.12.2"
pandas = "^2.1.0" pandas = "^2.1.0"
NUMPY = "1.24.2" NUMPY = "1.26.3"
pydantic = "^2.3.0" pydantic = "^2.3.0"
tls-client = "^1.0.1" tls-client = "^1.0.1"
markdownify = "^0.11.6" markdownify = "^0.13.1"
regex = "^2024.4.28"
[tool.poetry.group.dev.dependencies] [tool.poetry.group.dev.dependencies]
pytest = "^7.4.1" pytest = "^7.4.1"
jupyter = "^1.0.0" jupyter = "^1.0.0"
black = "^24.2.0" black = "*"
pre-commit = "^3.6.2" pre-commit = "*"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
[tool.black]
line-length = 88

View File

View File

@@ -5,23 +5,27 @@ from typing import Tuple
from concurrent.futures import ThreadPoolExecutor, as_completed from concurrent.futures import ThreadPoolExecutor, as_completed
from .jobs import JobType, Location from .jobs import JobType, Location
from .scrapers.utils import logger from .scrapers.utils import set_logger_level, extract_salary, create_logger
from .scrapers.indeed import IndeedScraper from .scrapers.indeed import IndeedScraper
from .scrapers.ziprecruiter import ZipRecruiterScraper from .scrapers.ziprecruiter import ZipRecruiterScraper
from .scrapers.glassdoor import GlassdoorScraper from .scrapers.glassdoor import GlassdoorScraper
from .scrapers.google import GoogleJobsScraper
from .scrapers.linkedin import LinkedInScraper from .scrapers.linkedin import LinkedInScraper
from .scrapers import ScraperInput, Site, JobResponse, Country from .scrapers.bayt import BaytScraper
from .scrapers import SalarySource, ScraperInput, Site, JobResponse, Country
from .scrapers.exceptions import ( from .scrapers.exceptions import (
LinkedInException, LinkedInException,
IndeedException, IndeedException,
ZipRecruiterException, ZipRecruiterException,
GlassdoorException, GlassdoorException,
GoogleJobsException,
) )
def scrape_jobs( def scrape_jobs(
site_name: str | list[str] | Site | list[Site] | None = None, site_name: str | list[str] | Site | list[Site] | None = None,
search_term: str | None = None, search_term: str | None = None,
google_search_term: str | None = None,
location: str | None = None, location: str | None = None,
distance: int | None = 50, distance: int | None = 50,
is_remote: bool = False, is_remote: bool = False,
@@ -30,24 +34,30 @@ def scrape_jobs(
results_wanted: int = 15, results_wanted: int = 15,
country_indeed: str = "usa", country_indeed: str = "usa",
hyperlinks: bool = False, hyperlinks: bool = False,
proxy: str | None = None, proxies: list[str] | str | None = None,
ca_cert: str | None = None,
description_format: str = "markdown", description_format: str = "markdown",
linkedin_fetch_description: bool | None = False, linkedin_fetch_description: bool | None = False,
linkedin_company_ids: list[int] | None = None, linkedin_company_ids: list[int] | None = None,
offset: int | None = 0, offset: int | None = 0,
hours_old: int = None, hours_old: int = None,
enforce_annual_salary: bool = False,
verbose: int = 0,
**kwargs, **kwargs,
) -> pd.DataFrame: ) -> pd.DataFrame:
""" """
Simultaneously scrapes job data from multiple job sites. Simultaneously scrapes job data from multiple job sites.
:return: results_wanted: pandas dataframe containing job data :return: pandas dataframe containing job data
""" """
SCRAPER_MAPPING = { SCRAPER_MAPPING = {
Site.LINKEDIN: LinkedInScraper, Site.LINKEDIN: LinkedInScraper,
Site.INDEED: IndeedScraper, Site.INDEED: IndeedScraper,
Site.ZIP_RECRUITER: ZipRecruiterScraper, Site.ZIP_RECRUITER: ZipRecruiterScraper,
Site.GLASSDOOR: GlassdoorScraper, Site.GLASSDOOR: GlassdoorScraper,
Site.GOOGLE: GoogleJobsScraper,
Site.BAYT: BaytScraper,
} }
set_logger_level(verbose)
def map_str_to_site(site_name: str) -> Site: def map_str_to_site(site_name: str) -> Site:
return Site[site_name.upper()] return Site[site_name.upper()]
@@ -79,6 +89,7 @@ def scrape_jobs(
site_type=get_site_type(), site_type=get_site_type(),
country=country_enum, country=country_enum,
search_term=search_term, search_term=search_term,
google_search_term=google_search_term,
location=location, location=location,
distance=distance, distance=distance,
is_remote=is_remote, is_remote=is_remote,
@@ -94,11 +105,11 @@ def scrape_jobs(
def scrape_site(site: Site) -> Tuple[str, JobResponse]: def scrape_site(site: Site) -> Tuple[str, JobResponse]:
scraper_class = SCRAPER_MAPPING[site] scraper_class = SCRAPER_MAPPING[site]
scraper = scraper_class(proxy=proxy) scraper = scraper_class(proxies=proxies, ca_cert=ca_cert)
scraped_data: JobResponse = scraper.scrape(scraper_input) scraped_data: JobResponse = scraper.scrape(scraper_input)
cap_name = site.value.capitalize() cap_name = site.value.capitalize()
site_name = "ZipRecruiter" if cap_name == "Zip_recruiter" else cap_name site_name = "ZipRecruiter" if cap_name == "Zip_recruiter" else cap_name
logger.info(f"{site_name} finished scraping") create_logger(site_name).info(f"finished scraping")
return site.value, scraped_data return site.value, scraped_data
site_to_jobs_dict = {} site_to_jobs_dict = {}
@@ -116,6 +127,21 @@ def scrape_jobs(
site_value, scraped_data = future.result() site_value, scraped_data = future.result()
site_to_jobs_dict[site_value] = scraped_data site_to_jobs_dict[site_value] = scraped_data
def convert_to_annual(job_data: dict):
if job_data["interval"] == "hourly":
job_data["min_amount"] *= 2080
job_data["max_amount"] *= 2080
if job_data["interval"] == "monthly":
job_data["min_amount"] *= 12
job_data["max_amount"] *= 12
if job_data["interval"] == "weekly":
job_data["min_amount"] *= 52
job_data["max_amount"] *= 52
if job_data["interval"] == "daily":
job_data["min_amount"] *= 260
job_data["max_amount"] *= 260
job_data["interval"] = "yearly"
jobs_dfs: list[pd.DataFrame] = [] jobs_dfs: list[pd.DataFrame] = []
for site, job_response in site_to_jobs_dict.items(): for site, job_response in site_to_jobs_dict.items():
@@ -148,12 +174,33 @@ def scrape_jobs(
job_data["min_amount"] = compensation_obj.get("min_amount") job_data["min_amount"] = compensation_obj.get("min_amount")
job_data["max_amount"] = compensation_obj.get("max_amount") job_data["max_amount"] = compensation_obj.get("max_amount")
job_data["currency"] = compensation_obj.get("currency", "USD") job_data["currency"] = compensation_obj.get("currency", "USD")
else: job_data["salary_source"] = SalarySource.DIRECT_DATA.value
job_data["interval"] = None if enforce_annual_salary and (
job_data["min_amount"] = None job_data["interval"]
job_data["max_amount"] = None and job_data["interval"] != "yearly"
job_data["currency"] = None and job_data["min_amount"]
and job_data["max_amount"]
):
convert_to_annual(job_data)
else:
if country_enum == Country.USA:
(
job_data["interval"],
job_data["min_amount"],
job_data["max_amount"],
job_data["currency"],
) = extract_salary(
job_data["description"],
enforce_annual_salary=enforce_annual_salary,
)
job_data["salary_source"] = SalarySource.DESCRIPTION.value
job_data["salary_source"] = (
job_data["salary_source"]
if "min_amount" in job_data and job_data["min_amount"]
else None
)
job_df = pd.DataFrame([job_data]) job_df = pd.DataFrame([job_data])
jobs_dfs.append(job_df) jobs_dfs.append(job_df)
@@ -166,32 +213,34 @@ def scrape_jobs(
# Desired column order # Desired column order
desired_order = [ desired_order = [
"id",
"site", "site",
"job_url_hyper" if hyperlinks else "job_url", "job_url_hyper" if hyperlinks else "job_url",
"job_url_direct", "job_url_direct",
"title", "title",
"company", "company",
"location", "location",
"job_type",
"date_posted", "date_posted",
"job_type",
"salary_source",
"interval", "interval",
"min_amount", "min_amount",
"max_amount", "max_amount",
"currency", "currency",
"is_remote", "is_remote",
"job_level",
"job_function",
"listing_type",
"emails", "emails",
"description", "description",
"company_industry",
"company_url", "company_url",
"company_logo",
"company_url_direct", "company_url_direct",
"company_addresses", "company_addresses",
"company_industry",
"company_num_employees", "company_num_employees",
"company_revenue", "company_revenue",
"company_description", "company_description",
"logo_photo_url",
"banner_photo_url",
"ceo_name",
"ceo_photo_url",
] ]
# Step 3: Ensure all desired columns are present, adding missing ones as empty # Step 3: Ensure all desired columns are present, adding missing ones as empty
@@ -203,6 +252,8 @@ def scrape_jobs(
jobs_df = jobs_df[desired_order] jobs_df = jobs_df[desired_order]
# Step 4: Sort the DataFrame as required # Step 4: Sort the DataFrame as required
return jobs_df.sort_values(by=["site", "date_posted"], ascending=[True, False]) return jobs_df.sort_values(
by=["site", "date_posted"], ascending=[True, False]
).reset_index(drop=True)
else: else:
return pd.DataFrame() return pd.DataFrame()

View File

@@ -92,7 +92,8 @@ class Country(Enum):
JAPAN = ("japan", "jp") JAPAN = ("japan", "jp")
KUWAIT = ("kuwait", "kw") KUWAIT = ("kuwait", "kw")
LUXEMBOURG = ("luxembourg", "lu") LUXEMBOURG = ("luxembourg", "lu")
MALAYSIA = ("malaysia", "malaysia") MALAYSIA = ("malaysia", "malaysia:my", "com")
MALTA = ("malta", "malta:mt", "mt")
MEXICO = ("mexico", "mx", "com.mx") MEXICO = ("mexico", "mx", "com.mx")
MOROCCO = ("morocco", "ma") MOROCCO = ("morocco", "ma")
NETHERLANDS = ("netherlands", "nl", "nl") NETHERLANDS = ("netherlands", "nl", "nl")
@@ -117,7 +118,7 @@ class Country(Enum):
SWITZERLAND = ("switzerland", "ch", "de:ch") SWITZERLAND = ("switzerland", "ch", "de:ch")
TAIWAN = ("taiwan", "tw") TAIWAN = ("taiwan", "tw")
THAILAND = ("thailand", "th") THAILAND = ("thailand", "th")
TURKEY = ("turkey", "tr") TURKEY = ("türkiye,turkey", "tr")
UKRAINE = ("ukraine", "ua") UKRAINE = ("ukraine", "ua")
UNITEDARABEMIRATES = ("united arab emirates", "ae") UNITEDARABEMIRATES = ("united arab emirates", "ae")
UK = ("uk,united kingdom", "uk:gb", "co.uk") UK = ("uk,united kingdom", "uk:gb", "co.uk")
@@ -226,6 +227,7 @@ class DescriptionFormat(Enum):
class JobPost(BaseModel): class JobPost(BaseModel):
id: str | None = None
title: str title: str
company_name: str | None company_name: str | None
job_url: str job_url: str
@@ -241,18 +243,25 @@ class JobPost(BaseModel):
date_posted: date | None = None date_posted: date | None = None
emails: list[str] | None = None emails: list[str] | None = None
is_remote: bool | None = None is_remote: bool | None = None
listing_type: str | None = None
# linkedin specific
job_level: str | None = None
# linkedin and indeed specific
company_industry: str | None = None
# indeed specific # indeed specific
company_addresses: str | None = None company_addresses: str | None = None
company_industry: str | None = None
company_num_employees: str | None = None company_num_employees: str | None = None
company_revenue: str | None = None company_revenue: str | None = None
company_description: str | None = None company_description: str | None = None
ceo_name: str | None = None company_logo: str | None = None
ceo_photo_url: str | None = None
logo_photo_url: str | None = None
banner_photo_url: str | None = None banner_photo_url: str | None = None
# linkedin only atm
job_function: str | None = None
class JobResponse(BaseModel): class JobResponse(BaseModel):
jobs: list[JobPost] = [] jobs: list[JobPost] = []

View File

@@ -1,5 +1,7 @@
from __future__ import annotations from __future__ import annotations
from abc import ABC, abstractmethod
from ..jobs import ( from ..jobs import (
Enum, Enum,
BaseModel, BaseModel,
@@ -15,11 +17,19 @@ class Site(Enum):
INDEED = "indeed" INDEED = "indeed"
ZIP_RECRUITER = "zip_recruiter" ZIP_RECRUITER = "zip_recruiter"
GLASSDOOR = "glassdoor" GLASSDOOR = "glassdoor"
GOOGLE = "google"
BAYT = "bayt"
class SalarySource(Enum):
DIRECT_DATA = "direct_data"
DESCRIPTION = "description"
class ScraperInput(BaseModel): class ScraperInput(BaseModel):
site_type: list[Site] site_type: list[Site]
search_term: str | None = None search_term: str | None = None
google_search_term: str | None = None
location: str | None = None location: str | None = None
country: Country | None = Country.USA country: Country | None = Country.USA
@@ -36,9 +46,13 @@ class ScraperInput(BaseModel):
hours_old: int | None = None hours_old: int | None = None
class Scraper: class Scraper(ABC):
def __init__(self, site: Site, proxy: list[str] | None = None): def __init__(
self, site: Site, proxies: list[str] | None = None, ca_cert: str | None = None
):
self.site = site self.site = site
self.proxy = (lambda p: {"http": p, "https": p} if p else None)(proxy) self.proxies = proxies
self.ca_cert = ca_cert
@abstractmethod
def scrape(self, scraper_input: ScraperInput) -> JobResponse: ... def scrape(self, scraper_input: ScraperInput) -> JobResponse: ...

View File

@@ -0,0 +1,145 @@
"""
jobspy.scrapers.bayt
~~~~~~~~~~~~~~~~~~~
This module contains routines to scrape Bayt.
"""
from __future__ import annotations
import random
import time
from bs4 import BeautifulSoup
from .. import Scraper, ScraperInput, Site
from ..utils import create_logger, create_session
from ...jobs import JobPost, JobResponse, Location, Country
log = create_logger("Bayt")
class BaytScraper(Scraper):
base_url = "https://www.bayt.com"
delay = 2
band_delay = 3
def __init__(
self, proxies: list[str] | str | None = None, ca_cert: str | None = None
):
super().__init__(Site.BAYT, proxies=proxies, ca_cert=ca_cert)
self.scraper_input = None
self.session = None
self.country = "worldwide"
def scrape(self, scraper_input: ScraperInput) -> JobResponse:
self.scraper_input = scraper_input
self.session = create_session(
proxies=self.proxies, ca_cert=self.ca_cert, is_tls=False, has_retry=True
)
job_list: list[JobPost] = []
page = 1
results_wanted = (
scraper_input.results_wanted if scraper_input.results_wanted else 10
)
while len(job_list) < results_wanted:
log.info(f"Fetching Bayt jobs page {page}")
job_elements = self._fetch_jobs(self.scraper_input.search_term, page)
if not job_elements:
break
if job_elements:
log.debug(
"First job element snippet:\n" + job_elements[0].prettify()[:500]
)
initial_count = len(job_list)
for job in job_elements:
try:
job_post = self._extract_job_info(job)
if job_post:
job_list.append(job_post)
if len(job_list) >= results_wanted:
break
else:
log.debug(
"Extraction returned None. Job snippet:\n"
+ job.prettify()[:500]
)
except Exception as e:
log.error(f"Bayt: Error extracting job info: {str(e)}")
continue
if len(job_list) == initial_count:
log.info(f"No new jobs found on page {page}. Ending pagination.")
break
page += 1
time.sleep(random.uniform(self.delay, self.delay + self.band_delay))
job_list = job_list[: scraper_input.results_wanted]
return JobResponse(jobs=job_list)
def _fetch_jobs(self, query: str, page: int) -> list | None:
"""
Grabs the job results for the given query and page number.
"""
try:
url = f"{self.base_url}/en/international/jobs/{query}-jobs/?page={page}"
response = self.session.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
job_listings = soup.find_all("li", attrs={"data-js-job": ""})
log.debug(f"Found {len(job_listings)} job listing elements")
return job_listings
except Exception as e:
log.error(f"Bayt: Error fetching jobs - {str(e)}")
return None
def _extract_job_info(self, job: BeautifulSoup) -> JobPost | None:
"""
Extracts the job information from a single job listing.
"""
# Find the h2 element holding the title and link (no class filtering)
job_general_information = job.find("h2")
if not job_general_information:
return
job_title = job_general_information.get_text(strip=True)
job_url = self._extract_job_url(job_general_information)
if not job_url:
return
# Extract company name using the original approach:
company_tag = job.find("div", class_="t-nowrap p10l")
company_name = (
company_tag.find("span").get_text(strip=True)
if company_tag and company_tag.find("span")
else None
)
# Extract location using the original approach:
location_tag = job.find("div", class_="t-mute t-small")
location = location_tag.get_text(strip=True) if location_tag else None
job_id = f"bayt-{abs(hash(job_url))}"
location_obj = Location(
city=location,
country=Country.from_string(self.country),
)
return JobPost(
id=job_id,
title=job_title,
company_name=company_name,
location=location_obj,
job_url=job_url,
)
def _extract_job_url(self, job_general_information: BeautifulSoup) -> str | None:
"""
Pulls the job URL from the 'a' within the h2 element.
"""
a_tag = job_general_information.find("a")
if a_tag and a_tag.has_attr("href"):
return self.base_url + a_tag["href"].strip()

View File

@@ -24,3 +24,13 @@ class ZipRecruiterException(Exception):
class GlassdoorException(Exception): class GlassdoorException(Exception):
def __init__(self, message=None): def __init__(self, message=None):
super().__init__(message or "An error occurred with Glassdoor") super().__init__(message or "An error occurred with Glassdoor")
class GoogleJobsException(Exception):
def __init__(self, message=None):
super().__init__(message or "An error occurred with Google Jobs")
class BaytException(Exception):
def __init__(self, message=None):
super().__init__(message or "An error occurred with Bayt")

View File

@@ -14,13 +14,13 @@ from typing import Optional, Tuple
from datetime import datetime, timedelta from datetime import datetime, timedelta
from concurrent.futures import ThreadPoolExecutor, as_completed from concurrent.futures import ThreadPoolExecutor, as_completed
from .constants import fallback_token, query_template, headers
from .. import Scraper, ScraperInput, Site from .. import Scraper, ScraperInput, Site
from ..utils import extract_emails_from_text from ..utils import extract_emails_from_text, create_logger
from ..exceptions import GlassdoorException from ..exceptions import GlassdoorException
from ..utils import ( from ..utils import (
create_session, create_session,
markdown_converter, markdown_converter,
logger,
) )
from ...jobs import ( from ...jobs import (
JobPost, JobPost,
@@ -32,14 +32,18 @@ from ...jobs import (
DescriptionFormat, DescriptionFormat,
) )
log = create_logger("Glassdoor")
class GlassdoorScraper(Scraper): class GlassdoorScraper(Scraper):
def __init__(self, proxy: Optional[str] = None): def __init__(
self, proxies: list[str] | str | None = None, ca_cert: str | None = None
):
""" """
Initializes GlassdoorScraper with the Glassdoor job search url Initializes GlassdoorScraper with the Glassdoor job search url
""" """
site = Site(Site.GLASSDOOR) site = Site(Site.GLASSDOOR)
super().__init__(site, proxy=proxy) super().__init__(site, proxies=proxies, ca_cert=ca_cert)
self.base_url = None self.base_url = None
self.country = None self.country = None
@@ -59,36 +63,39 @@ class GlassdoorScraper(Scraper):
self.scraper_input.results_wanted = min(900, scraper_input.results_wanted) self.scraper_input.results_wanted = min(900, scraper_input.results_wanted)
self.base_url = self.scraper_input.country.get_glassdoor_url() self.base_url = self.scraper_input.country.get_glassdoor_url()
self.session = create_session(self.proxy, is_tls=True, has_retry=True) self.session = create_session(
proxies=self.proxies, ca_cert=self.ca_cert, has_retry=True
)
token = self._get_csrf_token() token = self._get_csrf_token()
self.headers["gd-csrf-token"] = token if token else self.fallback_token headers["gd-csrf-token"] = token if token else fallback_token
self.session.headers.update(headers)
location_id, location_type = self._get_location( location_id, location_type = self._get_location(
scraper_input.location, scraper_input.is_remote scraper_input.location, scraper_input.is_remote
) )
if location_type is None: if location_type is None:
logger.error("Glassdoor: location not parsed") log.error("Glassdoor: location not parsed")
return JobResponse(jobs=[]) return JobResponse(jobs=[])
all_jobs: list[JobPost] = [] job_list: list[JobPost] = []
cursor = None cursor = None
range_start = 1 + (scraper_input.offset // self.jobs_per_page) range_start = 1 + (scraper_input.offset // self.jobs_per_page)
tot_pages = (scraper_input.results_wanted // self.jobs_per_page) + 2 tot_pages = (scraper_input.results_wanted // self.jobs_per_page) + 2
range_end = min(tot_pages, self.max_pages + 1) range_end = min(tot_pages, self.max_pages + 1)
for page in range(range_start, range_end): for page in range(range_start, range_end):
logger.info(f"Glassdoor search page: {page}") log.info(f"search page: {page} / {range_end - 1}")
try: try:
jobs, cursor = self._fetch_jobs_page( jobs, cursor = self._fetch_jobs_page(
scraper_input, location_id, location_type, page, cursor scraper_input, location_id, location_type, page, cursor
) )
all_jobs.extend(jobs) job_list.extend(jobs)
if not jobs or len(all_jobs) >= scraper_input.results_wanted: if not jobs or len(job_list) >= scraper_input.results_wanted:
all_jobs = all_jobs[: scraper_input.results_wanted] job_list = job_list[: scraper_input.results_wanted]
break break
except Exception as e: except Exception as e:
logger.error(f"Glassdoor: {str(e)}") log.error(f"Glassdoor: {str(e)}")
break break
return JobResponse(jobs=all_jobs) return JobResponse(jobs=job_list)
def _fetch_jobs_page( def _fetch_jobs_page(
self, self,
@@ -107,7 +114,6 @@ class GlassdoorScraper(Scraper):
payload = self._add_payload(location_id, location_type, page_num, cursor) payload = self._add_payload(location_id, location_type, page_num, cursor)
response = self.session.post( response = self.session.post(
f"{self.base_url}/graph", f"{self.base_url}/graph",
headers=self.headers,
timeout_seconds=15, timeout_seconds=15,
data=payload, data=payload,
) )
@@ -123,7 +129,7 @@ class GlassdoorScraper(Scraper):
ValueError, ValueError,
Exception, Exception,
) as e: ) as e:
logger.error(f"Glassdoor: {str(e)}") log.error(f"Glassdoor: {str(e)}")
return jobs, None return jobs, None
jobs_data = res_json["data"]["jobListings"]["jobListings"] jobs_data = res_json["data"]["jobListings"]["jobListings"]
@@ -148,9 +154,7 @@ class GlassdoorScraper(Scraper):
""" """
Fetches csrf token needed for API by visiting a generic page Fetches csrf token needed for API by visiting a generic page
""" """
res = self.session.get( res = self.session.get(f"{self.base_url}/Job/computer-science-jobs.htm")
f"{self.base_url}/Job/computer-science-jobs.htm", headers=self.headers
)
pattern = r'"token":\s*"([^"]+)"' pattern = r'"token":\s*"([^"]+)"'
matches = re.findall(pattern, res.text) matches = re.findall(pattern, res.text)
token = None token = None
@@ -189,7 +193,17 @@ class GlassdoorScraper(Scraper):
except: except:
description = None description = None
company_url = f"{self.base_url}Overview/W-EI_IE{company_id}.htm" company_url = f"{self.base_url}Overview/W-EI_IE{company_id}.htm"
company_logo = (
job_data["jobview"].get("overview", {}).get("squareLogoUrl", None)
)
listing_type = (
job_data["jobview"]
.get("header", {})
.get("adOrderSponsorshipLevel", "")
.lower()
)
return JobPost( return JobPost(
id=f"gd-{job_id}",
title=title, title=title,
company_url=company_url if company_id else None, company_url=company_url if company_id else None,
company_name=company_name, company_name=company_name,
@@ -200,6 +214,8 @@ class GlassdoorScraper(Scraper):
is_remote=is_remote, is_remote=is_remote,
description=description, description=description,
emails=extract_emails_from_text(description) if description else None, emails=extract_emails_from_text(description) if description else None,
company_logo=company_logo,
listing_type=listing_type,
) )
def _fetch_job_description(self, job_id): def _fetch_job_description(self, job_id):
@@ -231,7 +247,7 @@ class GlassdoorScraper(Scraper):
""", """,
} }
] ]
res = requests.post(url, json=body, headers=self.headers) res = requests.post(url, json=body, headers=headers)
if res.status_code != 200: if res.status_code != 200:
return None return None
data = res.json()[0] data = res.json()[0]
@@ -244,17 +260,16 @@ class GlassdoorScraper(Scraper):
if not location or is_remote: if not location or is_remote:
return "11047", "STATE" # remote options return "11047", "STATE" # remote options
url = f"{self.base_url}/findPopularLocationAjax.htm?maxLocationsToReturn=10&term={location}" url = f"{self.base_url}/findPopularLocationAjax.htm?maxLocationsToReturn=10&term={location}"
session = create_session(self.proxy, has_retry=True) res = self.session.get(url)
res = self.session.get(url, headers=self.headers)
if res.status_code != 200: if res.status_code != 200:
if res.status_code == 429: if res.status_code == 429:
err = f"429 Response - Blocked by Glassdoor for too many requests" err = f"429 Response - Blocked by Glassdoor for too many requests"
logger.error(err) log.error(err)
return None, None return None, None
else: else:
err = f"Glassdoor response status code {res.status_code}" err = f"Glassdoor response status code {res.status_code}"
err += f" - {res.text}" err += f" - {res.text}"
logger.error(f"Glassdoor response status code {res.status_code}") log.error(f"Glassdoor response status code {res.status_code}")
return None, None return None, None
items = res.json() items = res.json()
@@ -299,7 +314,7 @@ class GlassdoorScraper(Scraper):
"fromage": fromage, "fromage": fromage,
"sort": "date", "sort": "date",
}, },
"query": self.query_template, "query": query_template,
} }
if self.scraper_input.job_type: if self.scraper_input.job_type:
payload["variables"]["filterParams"].append( payload["variables"]["filterParams"].append(
@@ -347,188 +362,3 @@ class GlassdoorScraper(Scraper):
for cursor_data in pagination_cursors: for cursor_data in pagination_cursors:
if cursor_data["pageNumber"] == page_num: if cursor_data["pageNumber"] == page_num:
return cursor_data["cursor"] return cursor_data["cursor"]
fallback_token = "Ft6oHEWlRZrxDww95Cpazw:0pGUrkb2y3TyOpAIqF2vbPmUXoXVkD3oEGDVkvfeCerceQ5-n8mBg3BovySUIjmCPHCaW0H2nQVdqzbtsYqf4Q:wcqRqeegRUa9MVLJGyujVXB7vWFPjdaS1CtrrzJq-ok"
headers = {
"authority": "www.glassdoor.com",
"accept": "*/*",
"accept-language": "en-US,en;q=0.9",
"apollographql-client-name": "job-search-next",
"apollographql-client-version": "4.65.5",
"content-type": "application/json",
"origin": "https://www.glassdoor.com",
"referer": "https://www.glassdoor.com/",
"sec-ch-ua": '"Chromium";v="118", "Google Chrome";v="118", "Not=A?Brand";v="99"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"macOS"',
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-origin",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36",
}
query_template = """
query JobSearchResultsQuery(
$excludeJobListingIds: [Long!],
$keyword: String,
$locationId: Int,
$locationType: LocationTypeEnum,
$numJobsToShow: Int!,
$pageCursor: String,
$pageNumber: Int,
$filterParams: [FilterParams],
$originalPageUrl: String,
$seoFriendlyUrlInput: String,
$parameterUrlInput: String,
$seoUrl: Boolean
) {
jobListings(
contextHolder: {
searchParams: {
excludeJobListingIds: $excludeJobListingIds,
keyword: $keyword,
locationId: $locationId,
locationType: $locationType,
numPerPage: $numJobsToShow,
pageCursor: $pageCursor,
pageNumber: $pageNumber,
filterParams: $filterParams,
originalPageUrl: $originalPageUrl,
seoFriendlyUrlInput: $seoFriendlyUrlInput,
parameterUrlInput: $parameterUrlInput,
seoUrl: $seoUrl,
searchType: SR
}
}
) {
companyFilterOptions {
id
shortName
__typename
}
filterOptions
indeedCtk
jobListings {
...JobView
__typename
}
jobListingSeoLinks {
linkItems {
position
url
__typename
}
__typename
}
jobSearchTrackingKey
jobsPageSeoData {
pageMetaDescription
pageTitle
__typename
}
paginationCursors {
cursor
pageNumber
__typename
}
indexablePageForSeo
searchResultsMetadata {
searchCriteria {
implicitLocation {
id
localizedDisplayName
type
__typename
}
keyword
location {
id
shortName
localizedShortName
localizedDisplayName
type
__typename
}
__typename
}
helpCenterDomain
helpCenterLocale
jobSerpJobOutlook {
occupation
paragraph
__typename
}
showMachineReadableJobs
__typename
}
totalJobsCount
__typename
}
}
fragment JobView on JobListingSearchResult {
jobview {
header {
adOrderId
advertiserType
adOrderSponsorshipLevel
ageInDays
divisionEmployerName
easyApply
employer {
id
name
shortName
__typename
}
employerNameFromSearch
goc
gocConfidence
gocId
jobCountryId
jobLink
jobResultTrackingKey
jobTitleText
locationName
locationType
locId
needsCommission
payCurrency
payPeriod
payPeriodAdjustedPay {
p10
p50
p90
__typename
}
rating
salarySource
savedJobId
sponsored
__typename
}
job {
description
importConfigId
jobTitleId
jobTitleText
listingId
__typename
}
jobListingAdminDetails {
cpcVal
importConfigId
jobListingId
jobSourceId
userEligibleForAdminJobDetails
__typename
}
overview {
shortName
squareLogoUrl
__typename
}
__typename
}
__typename
}
"""

View File

@@ -0,0 +1,184 @@
headers = {
"authority": "www.glassdoor.com",
"accept": "*/*",
"accept-language": "en-US,en;q=0.9",
"apollographql-client-name": "job-search-next",
"apollographql-client-version": "4.65.5",
"content-type": "application/json",
"origin": "https://www.glassdoor.com",
"referer": "https://www.glassdoor.com/",
"sec-ch-ua": '"Chromium";v="118", "Google Chrome";v="118", "Not=A?Brand";v="99"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"macOS"',
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-origin",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36",
}
query_template = """
query JobSearchResultsQuery(
$excludeJobListingIds: [Long!],
$keyword: String,
$locationId: Int,
$locationType: LocationTypeEnum,
$numJobsToShow: Int!,
$pageCursor: String,
$pageNumber: Int,
$filterParams: [FilterParams],
$originalPageUrl: String,
$seoFriendlyUrlInput: String,
$parameterUrlInput: String,
$seoUrl: Boolean
) {
jobListings(
contextHolder: {
searchParams: {
excludeJobListingIds: $excludeJobListingIds,
keyword: $keyword,
locationId: $locationId,
locationType: $locationType,
numPerPage: $numJobsToShow,
pageCursor: $pageCursor,
pageNumber: $pageNumber,
filterParams: $filterParams,
originalPageUrl: $originalPageUrl,
seoFriendlyUrlInput: $seoFriendlyUrlInput,
parameterUrlInput: $parameterUrlInput,
seoUrl: $seoUrl,
searchType: SR
}
}
) {
companyFilterOptions {
id
shortName
__typename
}
filterOptions
indeedCtk
jobListings {
...JobView
__typename
}
jobListingSeoLinks {
linkItems {
position
url
__typename
}
__typename
}
jobSearchTrackingKey
jobsPageSeoData {
pageMetaDescription
pageTitle
__typename
}
paginationCursors {
cursor
pageNumber
__typename
}
indexablePageForSeo
searchResultsMetadata {
searchCriteria {
implicitLocation {
id
localizedDisplayName
type
__typename
}
keyword
location {
id
shortName
localizedShortName
localizedDisplayName
type
__typename
}
__typename
}
helpCenterDomain
helpCenterLocale
jobSerpJobOutlook {
occupation
paragraph
__typename
}
showMachineReadableJobs
__typename
}
totalJobsCount
__typename
}
}
fragment JobView on JobListingSearchResult {
jobview {
header {
adOrderId
advertiserType
adOrderSponsorshipLevel
ageInDays
divisionEmployerName
easyApply
employer {
id
name
shortName
__typename
}
employerNameFromSearch
goc
gocConfidence
gocId
jobCountryId
jobLink
jobResultTrackingKey
jobTitleText
locationName
locationType
locId
needsCommission
payCurrency
payPeriod
payPeriodAdjustedPay {
p10
p50
p90
__typename
}
rating
salarySource
savedJobId
sponsored
__typename
}
job {
description
importConfigId
jobTitleId
jobTitleText
listingId
__typename
}
jobListingAdminDetails {
cpcVal
importConfigId
jobListingId
jobSourceId
userEligibleForAdminJobDetails
__typename
}
overview {
shortName
squareLogoUrl
__typename
}
__typename
}
__typename
}
"""
fallback_token = "Ft6oHEWlRZrxDww95Cpazw:0pGUrkb2y3TyOpAIqF2vbPmUXoXVkD3oEGDVkvfeCerceQ5-n8mBg3BovySUIjmCPHCaW0H2nQVdqzbtsYqf4Q:wcqRqeegRUa9MVLJGyujVXB7vWFPjdaS1CtrrzJq-ok"

View File

@@ -0,0 +1,247 @@
"""
jobspy.scrapers.google
~~~~~~~~~~~~~~~~~~~
This module contains routines to scrape Google.
"""
from __future__ import annotations
import math
import re
import json
from typing import Tuple
from datetime import datetime, timedelta
from .constants import headers_jobs, headers_initial, async_param
from .. import Scraper, ScraperInput, Site
from ..utils import extract_emails_from_text, create_logger, extract_job_type
from ..utils import (
create_session,
)
from ...jobs import (
JobPost,
JobResponse,
Location,
JobType,
)
log = create_logger("Google")
class GoogleJobsScraper(Scraper):
def __init__(
self, proxies: list[str] | str | None = None, ca_cert: str | None = None
):
"""
Initializes Google Scraper with the Goodle jobs search url
"""
site = Site(Site.GOOGLE)
super().__init__(site, proxies=proxies, ca_cert=ca_cert)
self.country = None
self.session = None
self.scraper_input = None
self.jobs_per_page = 10
self.seen_urls = set()
self.url = "https://www.google.com/search"
self.jobs_url = "https://www.google.com/async/callback:550"
def scrape(self, scraper_input: ScraperInput) -> JobResponse:
"""
Scrapes Google for jobs with scraper_input criteria.
:param scraper_input: Information about job search criteria.
:return: JobResponse containing a list of jobs.
"""
self.scraper_input = scraper_input
self.scraper_input.results_wanted = min(900, scraper_input.results_wanted)
self.session = create_session(
proxies=self.proxies, ca_cert=self.ca_cert, is_tls=False, has_retry=True
)
forward_cursor, job_list = self._get_initial_cursor_and_jobs()
if forward_cursor is None:
log.warning(
"initial cursor not found, try changing your query or there was at most 10 results"
)
return JobResponse(jobs=job_list)
page = 1
while (
len(self.seen_urls) < scraper_input.results_wanted + scraper_input.offset
and forward_cursor
):
log.info(
f"search page: {page} / {math.ceil(scraper_input.results_wanted / self.jobs_per_page)}"
)
try:
jobs, forward_cursor = self._get_jobs_next_page(forward_cursor)
except Exception as e:
log.error(f"failed to get jobs on page: {page}, {e}")
break
if not jobs:
log.info(f"found no jobs on page: {page}")
break
job_list += jobs
page += 1
return JobResponse(
jobs=job_list[
scraper_input.offset : scraper_input.offset
+ scraper_input.results_wanted
]
)
def _get_initial_cursor_and_jobs(self) -> Tuple[str, list[JobPost]]:
"""Gets initial cursor and jobs to paginate through job listings"""
query = f"{self.scraper_input.search_term} jobs"
def get_time_range(hours_old):
if hours_old <= 24:
return "since yesterday"
elif hours_old <= 72:
return "in the last 3 days"
elif hours_old <= 168:
return "in the last week"
else:
return "in the last month"
job_type_mapping = {
JobType.FULL_TIME: "Full time",
JobType.PART_TIME: "Part time",
JobType.INTERNSHIP: "Internship",
JobType.CONTRACT: "Contract",
}
if self.scraper_input.job_type in job_type_mapping:
query += f" {job_type_mapping[self.scraper_input.job_type]}"
if self.scraper_input.location:
query += f" near {self.scraper_input.location}"
if self.scraper_input.hours_old:
time_filter = get_time_range(self.scraper_input.hours_old)
query += f" {time_filter}"
if self.scraper_input.is_remote:
query += " remote"
if self.scraper_input.google_search_term:
query = self.scraper_input.google_search_term
params = {"q": query, "udm": "8"}
response = self.session.get(self.url, headers=headers_initial, params=params)
pattern_fc = r'<div jsname="Yust4d"[^>]+data-async-fc="([^"]+)"'
match_fc = re.search(pattern_fc, response.text)
data_async_fc = match_fc.group(1) if match_fc else None
jobs_raw = self._find_job_info_initial_page(response.text)
jobs = []
for job_raw in jobs_raw:
job_post = self._parse_job(job_raw)
if job_post:
jobs.append(job_post)
return data_async_fc, jobs
def _get_jobs_next_page(self, forward_cursor: str) -> Tuple[list[JobPost], str]:
params = {"fc": [forward_cursor], "fcv": ["3"], "async": [async_param]}
response = self.session.get(self.jobs_url, headers=headers_jobs, params=params)
return self._parse_jobs(response.text)
def _parse_jobs(self, job_data: str) -> Tuple[list[JobPost], str]:
"""
Parses jobs on a page with next page cursor
"""
start_idx = job_data.find("[[[")
end_idx = job_data.rindex("]]]") + 3
s = job_data[start_idx:end_idx]
parsed = json.loads(s)[0]
pattern_fc = r'data-async-fc="([^"]+)"'
match_fc = re.search(pattern_fc, job_data)
data_async_fc = match_fc.group(1) if match_fc else None
jobs_on_page = []
for array in parsed:
_, job_data = array
if not job_data.startswith("[[["):
continue
job_d = json.loads(job_data)
job_info = self._find_job_info(job_d)
job_post = self._parse_job(job_info)
if job_post:
jobs_on_page.append(job_post)
return jobs_on_page, data_async_fc
def _parse_job(self, job_info: list):
job_url = job_info[3][0][0] if job_info[3] and job_info[3][0] else None
if job_url in self.seen_urls:
return
self.seen_urls.add(job_url)
title = job_info[0]
company_name = job_info[1]
location = city = job_info[2]
state = country = date_posted = None
if location and "," in location:
city, state, *country = [*map(lambda x: x.strip(), location.split(","))]
days_ago_str = job_info[12]
if type(days_ago_str) == str:
match = re.search(r"\d+", days_ago_str)
days_ago = int(match.group()) if match else None
date_posted = (datetime.now() - timedelta(days=days_ago)).date()
description = job_info[19]
job_post = JobPost(
id=f"go-{job_info[28]}",
title=title,
company_name=company_name,
location=Location(
city=city, state=state, country=country[0] if country else None
),
job_url=job_url,
date_posted=date_posted,
is_remote="remote" in description.lower() or "wfh" in description.lower(),
description=description,
emails=extract_emails_from_text(description),
job_type=extract_job_type(description),
)
return job_post
@staticmethod
def _find_job_info(jobs_data: list | dict) -> list | None:
"""Iterates through the JSON data to find the job listings"""
if isinstance(jobs_data, dict):
for key, value in jobs_data.items():
if key == "520084652" and isinstance(value, list):
return value
else:
result = GoogleJobsScraper._find_job_info(value)
if result:
return result
elif isinstance(jobs_data, list):
for item in jobs_data:
result = GoogleJobsScraper._find_job_info(item)
if result:
return result
return None
@staticmethod
def _find_job_info_initial_page(html_text: str):
pattern = f'520084652":(' + r"\[.*?\]\s*])\s*}\s*]\s*]\s*]\s*]\s*]"
results = []
matches = re.finditer(pattern, html_text)
import json
for match in matches:
try:
parsed_data = json.loads(match.group(1))
results.append(parsed_data)
except json.JSONDecodeError as e:
log.error(f"Failed to parse match: {str(e)}")
results.append({"raw_match": match.group(0), "error": str(e)})
return results

View File

@@ -0,0 +1,52 @@
headers_initial = {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"accept-language": "en-US,en;q=0.9",
"priority": "u=0, i",
"referer": "https://www.google.com/",
"sec-ch-prefers-color-scheme": "dark",
"sec-ch-ua": '"Chromium";v="130", "Google Chrome";v="130", "Not?A_Brand";v="99"',
"sec-ch-ua-arch": '"arm"',
"sec-ch-ua-bitness": '"64"',
"sec-ch-ua-form-factors": '"Desktop"',
"sec-ch-ua-full-version": '"130.0.6723.58"',
"sec-ch-ua-full-version-list": '"Chromium";v="130.0.6723.58", "Google Chrome";v="130.0.6723.58", "Not?A_Brand";v="99.0.0.0"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-model": '""',
"sec-ch-ua-platform": '"macOS"',
"sec-ch-ua-platform-version": '"15.0.1"',
"sec-ch-ua-wow64": "?0",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36",
"x-browser-channel": "stable",
"x-browser-copyright": "Copyright 2024 Google LLC. All rights reserved.",
"x-browser-year": "2024",
}
headers_jobs = {
"accept": "*/*",
"accept-language": "en-US,en;q=0.9",
"priority": "u=1, i",
"referer": "https://www.google.com/",
"sec-ch-prefers-color-scheme": "dark",
"sec-ch-ua": '"Chromium";v="130", "Google Chrome";v="130", "Not?A_Brand";v="99"',
"sec-ch-ua-arch": '"arm"',
"sec-ch-ua-bitness": '"64"',
"sec-ch-ua-form-factors": '"Desktop"',
"sec-ch-ua-full-version": '"130.0.6723.58"',
"sec-ch-ua-full-version-list": '"Chromium";v="130.0.6723.58", "Google Chrome";v="130.0.6723.58", "Not?A_Brand";v="99.0.0.0"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-model": '""',
"sec-ch-ua-platform": '"macOS"',
"sec-ch-ua-platform-version": '"15.0.1"',
"sec-ch-ua-wow64": "?0",
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-origin",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36",
}
async_param = "_basejs:/xjs/_/js/k=xjs.s.en_US.JwveA-JiKmg.2018.O/am=AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIAAAAAAAAACAAAoICAAAAAAAKMAfAAAAIAQAAAAAAAAAAAAACCAAAEJDAAACAAAAAGABAIAAARBAAABAAAAAgAgQAABAASKAfv8JAAABAAAAAAwAQAQACQAAAAAAcAEAQABoCAAAABAAAIABAACAAAAEAAAAFAAAAAAAAAAAAAAAAAAAAAAAAACAQADoBwAAAAAAAAAAAAAQBAAAAATQAAoACOAHAAAAAAAAAQAAAIIAAAA_ZAACAAAAAAAAcB8APB4wHFJ4AAAAAAAAAAAAAAAACECCYA5If0EACAAAAAAAAAAAAAAAAAAAUgRNXG4AMAE/dg=0/br=1/rs=ACT90oGxMeaFMCopIHq5tuQM-6_3M_VMjQ,_basecss:/xjs/_/ss/k=xjs.s.IwsGu62EDtU.L.B1.O/am=QOoQIAQAAAQAREADEBAAAAAAAAAAAAAAAAAAAAAgAQAAIAAAgAQAAAIAIAIAoEwCAADIC8AfsgEAawwAPkAAjgoAGAAAAAAAAEADAAAAAAIgAECHAAAAAAAAAAABAQAggAARQAAAQCEAAAAAIAAAABgAAAAAIAQIACCAAfB-AAFIQABoCEA_CgEAAIABAACEgHAEwwAEFQAM4CgAAAAAAAAAAAAACABCAAAAQEAAABAgAMCPAAA4AoE2BAEAggSAAIoAQAAAAAgAAAAACCAQAAAxEwA_ZAACAAAAAAAAAAkAAAAAAAAgAAAAAAAAAAAAAAAAAAAAAAAAQAEAAAAAAAAAAAAAAAAAAAAAQA/br=1/rs=ACT90oGZc36t3uUQkj0srnIvvbHjO2hgyg,_basecomb:/xjs/_/js/k=xjs.s.en_US.JwveA-JiKmg.2018.O/ck=xjs.s.IwsGu62EDtU.L.B1.O/am=QOoQIAQAAAQAREADEBAAAAAAAAAAAAAAAAAAAAAgAQAAIAAAgAQAAAKAIAoIqEwCAADIK8AfsgEAawwAPkAAjgoAGAAACCAAAEJDAAACAAIgAGCHAIAAARBAAABBAQAggAgRQABAQSOAfv8JIAABABgAAAwAYAQICSCAAfB-cAFIQABoCEA_ChEAAIABAACEgHAEwwAEFQAM4CgAAAAAAAAAAAAACABCAACAQEDoBxAgAMCPAAA4AoE2BAEAggTQAIoASOAHAAgAAAAACSAQAIIxEwA_ZAACAAAAAAAAcB8APB4wHFJ4AAAAAAAAAAAAAAAACECCYA5If0EACAAAAAAAAAAAAAAAAAAAUgRNXG4AMAE/d=1/ed=1/dg=0/br=1/ujg=1/rs=ACT90oFNLTjPzD_OAqhhtXwe2pg1T3WpBg,_fmt:prog,_id:fc_5FwaZ86OKsfdwN4P4La3yA4_2"

View File

@@ -10,16 +10,15 @@ from __future__ import annotations
import math import math
from typing import Tuple from typing import Tuple
from datetime import datetime from datetime import datetime
from concurrent.futures import ThreadPoolExecutor, Future
import requests
from .constants import job_search_query, api_headers
from .. import Scraper, ScraperInput, Site from .. import Scraper, ScraperInput, Site
from ..utils import ( from ..utils import (
extract_emails_from_text, extract_emails_from_text,
get_enum_from_job_type, get_enum_from_job_type,
markdown_converter, markdown_converter,
logger, create_session,
create_logger,
) )
from ...jobs import ( from ...jobs import (
JobPost, JobPost,
@@ -31,12 +30,21 @@ from ...jobs import (
DescriptionFormat, DescriptionFormat,
) )
log = create_logger("Indeed")
class IndeedScraper(Scraper): class IndeedScraper(Scraper):
def __init__(self, proxy: str | None = None): def __init__(
self, proxies: list[str] | str | None = None, ca_cert: str | None = None
):
""" """
Initializes IndeedScraper with the Indeed API url Initializes IndeedScraper with the Indeed API url
""" """
super().__init__(Site.INDEED, proxies=proxies)
self.session = create_session(
proxies=self.proxies, ca_cert=ca_cert, is_tls=False
)
self.scraper_input = None self.scraper_input = None
self.jobs_per_page = 100 self.jobs_per_page = 100
self.num_workers = 10 self.num_workers = 10
@@ -45,8 +53,6 @@ class IndeedScraper(Scraper):
self.api_country_code = None self.api_country_code = None
self.base_url = None self.base_url = None
self.api_url = "https://apis.indeed.com/graphql" self.api_url = "https://apis.indeed.com/graphql"
site = Site(Site.INDEED)
super().__init__(site, proxy=proxy)
def scrape(self, scraper_input: ScraperInput) -> JobResponse: def scrape(self, scraper_input: ScraperInput) -> JobResponse:
""" """
@@ -57,29 +63,29 @@ class IndeedScraper(Scraper):
self.scraper_input = scraper_input self.scraper_input = scraper_input
domain, self.api_country_code = self.scraper_input.country.indeed_domain_value domain, self.api_country_code = self.scraper_input.country.indeed_domain_value
self.base_url = f"https://{domain}.indeed.com" self.base_url = f"https://{domain}.indeed.com"
self.headers = self.api_headers.copy() self.headers = api_headers.copy()
self.headers["indeed-co"] = self.scraper_input.country.indeed_domain_value self.headers["indeed-co"] = self.scraper_input.country.indeed_domain_value
job_list = [] job_list = []
page = 1 page = 1
cursor = None cursor = None
offset_pages = math.ceil(self.scraper_input.offset / 100)
for _ in range(offset_pages):
logger.info(f"Indeed skipping search page: {page}")
__, cursor = self._scrape_page(cursor)
if not __:
logger.info(f"Indeed found no jobs on page: {page}")
break
while len(self.seen_urls) < scraper_input.results_wanted: while len(self.seen_urls) < scraper_input.results_wanted + scraper_input.offset:
logger.info(f"Indeed search page: {page}") log.info(
f"search page: {page} / {math.ceil(scraper_input.results_wanted / self.jobs_per_page)}"
)
jobs, cursor = self._scrape_page(cursor) jobs, cursor = self._scrape_page(cursor)
if not jobs: if not jobs:
logger.info(f"Indeed found no jobs on page: {page}") log.info(f"found no jobs on page: {page}")
break break
job_list += jobs job_list += jobs
page += 1 page += 1
return JobResponse(jobs=job_list[: scraper_input.results_wanted]) return JobResponse(
jobs=job_list[
scraper_input.offset : scraper_input.offset
+ scraper_input.results_wanted
]
)
def _scrape_page(self, cursor: str | None) -> Tuple[list[JobPost], str | None]: def _scrape_page(self, cursor: str | None) -> Tuple[list[JobPost], str | None]:
""" """
@@ -90,18 +96,18 @@ class IndeedScraper(Scraper):
jobs = [] jobs = []
new_cursor = None new_cursor = None
filters = self._build_filters() filters = self._build_filters()
location = ( search_term = (
self.scraper_input.location self.scraper_input.search_term.replace('"', '\\"')
or self.scraper_input.country.value[0].split(",")[-1] if self.scraper_input.search_term
else ""
) )
query = self.job_search_query.format( query = job_search_query.format(
what=( what=(f'what: "{search_term}"' if search_term else ""),
f'what: "{self.scraper_input.search_term}"' location=(
if self.scraper_input.search_term f'location: {{where: "{self.scraper_input.location}", radius: {self.scraper_input.distance}, radiusUnit: MILES}}'
if self.scraper_input.location
else "" else ""
), ),
location=location,
radius=self.scraper_input.distance,
dateOnIndeed=self.scraper_input.hours_old, dateOnIndeed=self.scraper_input.hours_old,
cursor=f'cursor: "{cursor}"' if cursor else "", cursor=f'cursor: "{cursor}"' if cursor else "",
filters=filters, filters=filters,
@@ -109,29 +115,30 @@ class IndeedScraper(Scraper):
payload = { payload = {
"query": query, "query": query,
} }
api_headers = self.api_headers.copy() api_headers_temp = api_headers.copy()
api_headers["indeed-co"] = self.api_country_code api_headers_temp["indeed-co"] = self.api_country_code
response = requests.post( response = self.session.post(
self.api_url, self.api_url,
headers=api_headers, headers=api_headers_temp,
json=payload, json=payload,
proxies=self.proxy,
timeout=10, timeout=10,
verify=False,
) )
if response.status_code != 200: if not response.ok:
logger.info( log.info(
f"Indeed responded with status code: {response.status_code} (submit GitHub issue if this appears to be a beg)" f"responded with status code: {response.status_code} (submit GitHub issue if this appears to be a bug)"
) )
return jobs, new_cursor return jobs, new_cursor
data = response.json() data = response.json()
jobs = data["data"]["jobSearch"]["results"] jobs = data["data"]["jobSearch"]["results"]
new_cursor = data["data"]["jobSearch"]["pageInfo"]["nextCursor"] new_cursor = data["data"]["jobSearch"]["pageInfo"]["nextCursor"]
with ThreadPoolExecutor(max_workers=self.num_workers) as executor: job_list = []
job_results: list[Future] = [ for job in jobs:
executor.submit(self._process_job, job["job"]) for job in jobs processed_job = self._process_job(job["job"])
] if processed_job:
job_list = [result.result() for result in job_results if result.result()] job_list.append(processed_job)
return job_list, new_cursor return job_list, new_cursor
def _build_filters(self): def _build_filters(self):
@@ -151,6 +158,15 @@ class IndeedScraper(Scraper):
""".format( """.format(
start=self.scraper_input.hours_old start=self.scraper_input.hours_old
) )
elif self.scraper_input.easy_apply:
filters_str = """
filters: {
keyword: {
field: "indeedApplyScope",
keys: ["DESKTOP"]
}
}
"""
elif self.scraper_input.job_type or self.scraper_input.is_remote: elif self.scraper_input.job_type or self.scraper_input.is_remote:
job_type_key_mapping = { job_type_key_mapping = {
JobType.FULL_TIME: "CF3CP", JobType.FULL_TIME: "CF3CP",
@@ -168,7 +184,7 @@ class IndeedScraper(Scraper):
keys.append("DSQF7") keys.append("DSQF7")
if keys: if keys:
keys_str = '", "'.join(keys) # Prepare your keys string keys_str = '", "'.join(keys)
filters_str = f""" filters_str = f"""
filters: {{ filters: {{
composite: {{ composite: {{
@@ -204,6 +220,7 @@ class IndeedScraper(Scraper):
employer_details = employer.get("employerDetails", {}) if employer else {} employer_details = employer.get("employerDetails", {}) if employer else {}
rel_url = job["employer"]["relativeCompanyPageUrl"] if job["employer"] else None rel_url = job["employer"]["relativeCompanyPageUrl"] if job["employer"] else None
return JobPost( return JobPost(
id=f'in-{job["key"]}',
title=job["title"], title=job["title"],
description=description, description=description,
company_name=job["employer"].get("name") if job.get("employer") else None, company_name=job["employer"].get("name") if job.get("employer") else None,
@@ -217,7 +234,7 @@ class IndeedScraper(Scraper):
country=job.get("location", {}).get("countryCode"), country=job.get("location", {}).get("countryCode"),
), ),
job_type=job_type, job_type=job_type,
compensation=self._get_compensation(job), compensation=self._get_compensation(job["compensation"]),
date_posted=date_posted, date_posted=date_posted,
job_url=job_url, job_url=job_url,
job_url_direct=( job_url_direct=(
@@ -235,24 +252,18 @@ class IndeedScraper(Scraper):
.replace("Iv1", "") .replace("Iv1", "")
.replace("_", " ") .replace("_", " ")
.title() .title()
.strip()
if employer_details.get("industry") if employer_details.get("industry")
else None else None
), ),
company_num_employees=employer_details.get("employeesLocalizedLabel"), company_num_employees=employer_details.get("employeesLocalizedLabel"),
company_revenue=employer_details.get("revenueLocalizedLabel"), company_revenue=employer_details.get("revenueLocalizedLabel"),
company_description=employer_details.get("briefDescription"), company_description=employer_details.get("briefDescription"),
ceo_name=employer_details.get("ceoName"), company_logo=(
ceo_photo_url=employer_details.get("ceoPhotoUrl"),
logo_photo_url=(
employer["images"].get("squareLogoUrl") employer["images"].get("squareLogoUrl")
if employer and employer.get("images") if employer and employer.get("images")
else None else None
), ),
banner_photo_url=(
employer["images"].get("headerImageUrl")
if employer and employer.get("images")
else None
),
) )
@staticmethod @staticmethod
@@ -271,14 +282,19 @@ class IndeedScraper(Scraper):
return job_types return job_types
@staticmethod @staticmethod
def _get_compensation(job: dict) -> Compensation | None: def _get_compensation(compensation: dict) -> Compensation | None:
""" """
Parses the job to get compensation Parses the job to get compensation
:param job: :param job:
:param job:
:return: compensation object :return: compensation object
""" """
comp = job["compensation"]["baseSalary"] if not compensation["baseSalary"] and not compensation["estimated"]:
return None
comp = (
compensation["baseSalary"]
if compensation["baseSalary"]
else compensation["estimated"]["baseSalary"]
)
if not comp: if not comp:
return None return None
interval = IndeedScraper._get_compensation_interval(comp["unitOfWork"]) interval = IndeedScraper._get_compensation_interval(comp["unitOfWork"])
@@ -288,9 +304,13 @@ class IndeedScraper(Scraper):
max_range = comp["range"].get("max") max_range = comp["range"].get("max")
return Compensation( return Compensation(
interval=interval, interval=interval,
min_amount=round(min_range, 2) if min_range is not None else None, min_amount=int(min_range) if min_range is not None else None,
max_amount=round(max_range, 2) if max_range is not None else None, max_amount=int(max_range) if max_range is not None else None,
currency=job["compensation"]["currencyCode"], currency=(
compensation["estimated"]["currencyCode"]
if compensation["estimated"]
else compensation["currencyCode"]
),
) )
@staticmethod @staticmethod
@@ -328,98 +348,3 @@ class IndeedScraper(Scraper):
return CompensationInterval[mapped_interval] return CompensationInterval[mapped_interval]
else: else:
raise ValueError(f"Unsupported interval: {interval}") raise ValueError(f"Unsupported interval: {interval}")
api_headers = {
"Host": "apis.indeed.com",
"content-type": "application/json",
"indeed-api-key": "161092c2017b5bbab13edb12461a62d5a833871e7cad6d9d475304573de67ac8",
"accept": "application/json",
"indeed-locale": "en-US",
"accept-language": "en-US,en;q=0.9",
"user-agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 16_6_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148 Indeed App 193.1",
"indeed-app-info": "appv=193.1; appid=com.indeed.jobsearch; osv=16.6.1; os=ios; dtype=phone",
}
job_search_query = """
query GetJobData {{
jobSearch(
{what}
location: {{ where: "{location}", radius: {radius}, radiusUnit: MILES }}
includeSponsoredResults: NONE
limit: 100
sort: DATE
{cursor}
{filters}
) {{
pageInfo {{
nextCursor
}}
results {{
trackingKey
job {{
key
title
datePublished
dateOnIndeed
description {{
html
}}
location {{
countryName
countryCode
admin1Code
city
postalCode
streetAddress
formatted {{
short
long
}}
}}
compensation {{
baseSalary {{
unitOfWork
range {{
... on Range {{
min
max
}}
}}
}}
currencyCode
}}
attributes {{
key
label
}}
employer {{
relativeCompanyPageUrl
name
dossier {{
employerDetails {{
addresses
industry
employeesLocalizedLabel
revenueLocalizedLabel
briefDescription
ceoName
ceoPhotoUrl
}}
images {{
headerImageUrl
squareLogoUrl
}}
links {{
corporateWebsite
}}
}}
}}
recruit {{
viewJobUrl
detailedSalary
workSchedule
}}
}}
}}
}}
}}
"""

View File

@@ -0,0 +1,109 @@
job_search_query = """
query GetJobData {{
jobSearch(
{what}
{location}
limit: 100
{cursor}
sort: RELEVANCE
{filters}
) {{
pageInfo {{
nextCursor
}}
results {{
trackingKey
job {{
source {{
name
}}
key
title
datePublished
dateOnIndeed
description {{
html
}}
location {{
countryName
countryCode
admin1Code
city
postalCode
streetAddress
formatted {{
short
long
}}
}}
compensation {{
estimated {{
currencyCode
baseSalary {{
unitOfWork
range {{
... on Range {{
min
max
}}
}}
}}
}}
baseSalary {{
unitOfWork
range {{
... on Range {{
min
max
}}
}}
}}
currencyCode
}}
attributes {{
key
label
}}
employer {{
relativeCompanyPageUrl
name
dossier {{
employerDetails {{
addresses
industry
employeesLocalizedLabel
revenueLocalizedLabel
briefDescription
ceoName
ceoPhotoUrl
}}
images {{
headerImageUrl
squareLogoUrl
}}
links {{
corporateWebsite
}}
}}
}}
recruit {{
viewJobUrl
detailedSalary
workSchedule
}}
}}
}}
}}
}}
"""
api_headers = {
"Host": "apis.indeed.com",
"content-type": "application/json",
"indeed-api-key": "161092c2017b5bbab13edb12461a62d5a833871e7cad6d9d475304573de67ac8",
"accept": "application/json",
"indeed-locale": "en-US",
"accept-language": "en-US,en;q=0.9",
"user-agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 16_6_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148 Indeed App 193.1",
"indeed-app-info": "appv=193.1; appid=com.indeed.jobsearch; osv=16.6.1; os=ios; dtype=phone",
}

View File

@@ -7,19 +7,21 @@ This module contains routines to scrape LinkedIn.
from __future__ import annotations from __future__ import annotations
import math
import time import time
import random import random
import regex as re
from typing import Optional from typing import Optional
from datetime import datetime from datetime import datetime
from threading import Lock
from bs4.element import Tag from bs4.element import Tag
from bs4 import BeautifulSoup from bs4 import BeautifulSoup
from urllib.parse import urlparse, urlunparse from urllib.parse import urlparse, urlunparse, unquote
from .constants import headers
from .. import Scraper, ScraperInput, Site from .. import Scraper, ScraperInput, Site
from ..exceptions import LinkedInException from ..exceptions import LinkedInException
from ..utils import create_session from ..utils import create_session, remove_attributes, create_logger
from ...jobs import ( from ...jobs import (
JobPost, JobPost,
Location, Location,
@@ -30,13 +32,14 @@ from ...jobs import (
DescriptionFormat, DescriptionFormat,
) )
from ..utils import ( from ..utils import (
logger,
extract_emails_from_text, extract_emails_from_text,
get_enum_from_job_type, get_enum_from_job_type,
currency_parser, currency_parser,
markdown_converter, markdown_converter,
) )
log = create_logger("LinkedIn")
class LinkedInScraper(Scraper): class LinkedInScraper(Scraper):
base_url = "https://www.linkedin.com" base_url = "https://www.linkedin.com"
@@ -44,13 +47,25 @@ class LinkedInScraper(Scraper):
band_delay = 4 band_delay = 4
jobs_per_page = 25 jobs_per_page = 25
def __init__(self, proxy: Optional[str] = None): def __init__(
self, proxies: list[str] | str | None = None, ca_cert: str | None = None
):
""" """
Initializes LinkedInScraper with the LinkedIn job search url Initializes LinkedInScraper with the LinkedIn job search url
""" """
super().__init__(Site(Site.LINKEDIN), proxy=proxy) super().__init__(Site.LINKEDIN, proxies=proxies, ca_cert=ca_cert)
self.session = create_session(
proxies=self.proxies,
ca_cert=ca_cert,
is_tls=False,
has_retry=True,
delay=5,
clear_cookies=True,
)
self.session.headers.update(headers)
self.scraper_input = None self.scraper_input = None
self.country = "worldwide" self.country = "worldwide"
self.job_url_direct_regex = re.compile(r'(?<=\?url=)[^"]+')
def scrape(self, scraper_input: ScraperInput) -> JobResponse: def scrape(self, scraper_input: ScraperInput) -> JobResponse:
""" """
@@ -60,18 +75,20 @@ class LinkedInScraper(Scraper):
""" """
self.scraper_input = scraper_input self.scraper_input = scraper_input
job_list: list[JobPost] = [] job_list: list[JobPost] = []
seen_urls = set() seen_ids = set()
url_lock = Lock() start = scraper_input.offset // 10 * 10 if scraper_input.offset else 0
page = scraper_input.offset // 25 + 25 if scraper_input.offset else 0 request_count = 0
seconds_old = ( seconds_old = (
scraper_input.hours_old * 3600 if scraper_input.hours_old else None scraper_input.hours_old * 3600 if scraper_input.hours_old else None
) )
continue_search = ( continue_search = (
lambda: len(job_list) < scraper_input.results_wanted and page < 1000 lambda: len(job_list) < scraper_input.results_wanted and start < 1000
) )
while continue_search(): while continue_search():
logger.info(f"LinkedIn search page: {page // 25 + 1}") request_count += 1
session = create_session(is_tls=False, has_retry=True, delay=5) log.info(
f"search page: {request_count} / {math.ceil(scraper_input.results_wanted / 10)}"
)
params = { params = {
"keywords": scraper_input.search_term, "keywords": scraper_input.search_term,
"location": scraper_input.location, "location": scraper_input.location,
@@ -83,7 +100,7 @@ class LinkedInScraper(Scraper):
else None else None
), ),
"pageNum": 0, "pageNum": 0,
"start": page + scraper_input.offset, "start": start,
"f_AL": "true" if scraper_input.easy_apply else None, "f_AL": "true" if scraper_input.easy_apply else None,
"f_C": ( "f_C": (
",".join(map(str, scraper_input.linkedin_company_ids)) ",".join(map(str, scraper_input.linkedin_company_ids))
@@ -96,12 +113,9 @@ class LinkedInScraper(Scraper):
params = {k: v for k, v in params.items() if v is not None} params = {k: v for k, v in params.items() if v is not None}
try: try:
response = session.get( response = self.session.get(
f"{self.base_url}/jobs-guest/jobs/api/seeMoreJobPostings/search?", f"{self.base_url}/jobs-guest/jobs/api/seeMoreJobPostings/search?",
params=params, params=params,
allow_redirects=True,
proxies=self.proxy,
headers=self.headers,
timeout=10, timeout=10,
) )
if response.status_code not in range(200, 400): if response.status_code not in range(200, 400):
@@ -112,13 +126,13 @@ class LinkedInScraper(Scraper):
else: else:
err = f"LinkedIn response status code {response.status_code}" err = f"LinkedIn response status code {response.status_code}"
err += f" - {response.text}" err += f" - {response.text}"
logger.error(err) log.error(err)
return JobResponse(jobs=job_list) return JobResponse(jobs=job_list)
except Exception as e: except Exception as e:
if "Proxy responded with" in str(e): if "Proxy responded with" in str(e):
logger.error(f"LinkedIn: Bad proxy") log.error(f"LinkedIn: Bad proxy")
else: else:
logger.error(f"LinkedIn: {str(e)}") log.error(f"LinkedIn: {str(e)}")
return JobResponse(jobs=job_list) return JobResponse(jobs=job_list)
soup = BeautifulSoup(response.text, "html.parser") soup = BeautifulSoup(response.text, "html.parser")
@@ -127,36 +141,34 @@ class LinkedInScraper(Scraper):
return JobResponse(jobs=job_list) return JobResponse(jobs=job_list)
for job_card in job_cards: for job_card in job_cards:
job_url = None
href_tag = job_card.find("a", class_="base-card__full-link") href_tag = job_card.find("a", class_="base-card__full-link")
if href_tag and "href" in href_tag.attrs: if href_tag and "href" in href_tag.attrs:
href = href_tag.attrs["href"].split("?")[0] href = href_tag.attrs["href"].split("?")[0]
job_id = href.split("-")[-1] job_id = href.split("-")[-1]
job_url = f"{self.base_url}/jobs/view/{job_id}"
with url_lock: if job_id in seen_ids:
if job_url in seen_urls:
continue continue
seen_urls.add(job_url) seen_ids.add(job_id)
try:
fetch_desc = scraper_input.linkedin_fetch_description try:
job_post = self._process_job(job_card, job_url, fetch_desc) fetch_desc = scraper_input.linkedin_fetch_description
if job_post: job_post = self._process_job(job_card, job_id, fetch_desc)
job_list.append(job_post) if job_post:
if not continue_search(): job_list.append(job_post)
break if not continue_search():
except Exception as e: break
raise LinkedInException(str(e)) except Exception as e:
raise LinkedInException(str(e))
if continue_search(): if continue_search():
time.sleep(random.uniform(self.delay, self.delay + self.band_delay)) time.sleep(random.uniform(self.delay, self.delay + self.band_delay))
page += self.jobs_per_page start += len(job_list)
job_list = job_list[: scraper_input.results_wanted] job_list = job_list[: scraper_input.results_wanted]
return JobResponse(jobs=job_list) return JobResponse(jobs=job_list)
def _process_job( def _process_job(
self, job_card: Tag, job_url: str, full_descr: bool self, job_card: Tag, job_id: str, full_descr: bool
) -> Optional[JobPost]: ) -> Optional[JobPost]:
salary_tag = job_card.find("span", class_="job-search-card__salary-info") salary_tag = job_card.find("span", class_="job-search-card__salary-info")
@@ -194,48 +206,51 @@ class LinkedInScraper(Scraper):
if metadata_card if metadata_card
else None else None
) )
date_posted = description = job_type = None date_posted = None
if datetime_tag and "datetime" in datetime_tag.attrs: if datetime_tag and "datetime" in datetime_tag.attrs:
datetime_str = datetime_tag["datetime"] datetime_str = datetime_tag["datetime"]
try: try:
date_posted = datetime.strptime(datetime_str, "%Y-%m-%d") date_posted = datetime.strptime(datetime_str, "%Y-%m-%d")
except: except:
date_posted = None date_posted = None
benefits_tag = job_card.find("span", class_="result-benefits__text") job_details = {}
if full_descr: if full_descr:
description, job_type = self._get_job_description(job_url) job_details = self._get_job_details(job_id)
return JobPost( return JobPost(
id=f"li-{job_id}",
title=title, title=title,
company_name=company, company_name=company,
company_url=company_url, company_url=company_url,
location=location, location=location,
date_posted=date_posted, date_posted=date_posted,
job_url=job_url, job_url=f"{self.base_url}/jobs/view/{job_id}",
compensation=compensation, compensation=compensation,
job_type=job_type, job_type=job_details.get("job_type"),
description=description, job_level=job_details.get("job_level", "").lower(),
emails=extract_emails_from_text(description) if description else None, company_industry=job_details.get("company_industry"),
description=job_details.get("description"),
job_url_direct=job_details.get("job_url_direct"),
emails=extract_emails_from_text(job_details.get("description")),
company_logo=job_details.get("company_logo"),
job_function=job_details.get("job_function"),
) )
def _get_job_description( def _get_job_details(self, job_id: str) -> dict:
self, job_page_url: str
) -> tuple[None, None] | tuple[str | None, tuple[str | None, JobType | None]]:
""" """
Retrieves job description by going to the job page url Retrieves job description and other job details by going to the job page url
:param job_page_url: :param job_page_url:
:return: description or None :return: dict
""" """
try: try:
session = create_session(is_tls=False, has_retry=True) response = self.session.get(
response = session.get( f"{self.base_url}/jobs/view/{job_id}", timeout=5
job_page_url, headers=self.headers, timeout=5, proxies=self.proxy
) )
response.raise_for_status() response.raise_for_status()
except: except:
return None, None return {}
if response.url == "https://www.linkedin.com/signup": if "linkedin.com/signup" in response.url:
return None, None return {}
soup = BeautifulSoup(response.text, "html.parser") soup = BeautifulSoup(response.text, "html.parser")
div_content = soup.find( div_content = soup.find(
@@ -243,17 +258,37 @@ class LinkedInScraper(Scraper):
) )
description = None description = None
if div_content is not None: if div_content is not None:
def remove_attributes(tag):
for attr in list(tag.attrs):
del tag[attr]
return tag
div_content = remove_attributes(div_content) div_content = remove_attributes(div_content)
description = div_content.prettify(formatter="html") description = div_content.prettify(formatter="html")
if self.scraper_input.description_format == DescriptionFormat.MARKDOWN: if self.scraper_input.description_format == DescriptionFormat.MARKDOWN:
description = markdown_converter(description) description = markdown_converter(description)
return description, self._parse_job_type(soup)
h3_tag = soup.find(
"h3", text=lambda text: text and "Job function" in text.strip()
)
job_function = None
if h3_tag:
job_function_span = h3_tag.find_next(
"span", class_="description__job-criteria-text"
)
if job_function_span:
job_function = job_function_span.text.strip()
company_logo = (
logo_image.get("data-delayed-url")
if (logo_image := soup.find("img", {"class": "artdeco-entity-image"}))
else None
)
return {
"description": description,
"job_level": self._parse_job_level(soup),
"company_industry": self._parse_company_industry(soup),
"job_type": self._parse_job_type(soup),
"job_url_direct": self._parse_job_url_direct(soup),
"company_logo": company_logo,
"job_function": job_function,
}
def _get_location(self, metadata_card: Optional[Tag]) -> Location: def _get_location(self, metadata_card: Optional[Tag]) -> Location:
""" """
@@ -306,6 +341,69 @@ class LinkedInScraper(Scraper):
return [get_enum_from_job_type(employment_type)] if employment_type else [] return [get_enum_from_job_type(employment_type)] if employment_type else []
@staticmethod
def _parse_job_level(soup_job_level: BeautifulSoup) -> str | None:
"""
Gets the job level from job page
:param soup_job_level:
:return: str
"""
h3_tag = soup_job_level.find(
"h3",
class_="description__job-criteria-subheader",
string=lambda text: "Seniority level" in text,
)
job_level = None
if h3_tag:
job_level_span = h3_tag.find_next_sibling(
"span",
class_="description__job-criteria-text description__job-criteria-text--criteria",
)
if job_level_span:
job_level = job_level_span.get_text(strip=True)
return job_level
@staticmethod
def _parse_company_industry(soup_industry: BeautifulSoup) -> str | None:
"""
Gets the company industry from job page
:param soup_industry:
:return: str
"""
h3_tag = soup_industry.find(
"h3",
class_="description__job-criteria-subheader",
string=lambda text: "Industries" in text,
)
industry = None
if h3_tag:
industry_span = h3_tag.find_next_sibling(
"span",
class_="description__job-criteria-text description__job-criteria-text--criteria",
)
if industry_span:
industry = industry_span.get_text(strip=True)
return industry
def _parse_job_url_direct(self, soup: BeautifulSoup) -> str | None:
"""
Gets the job url direct from job page
:param soup:
:return: str
"""
job_url_direct = None
job_url_direct_content = soup.find("code", id="applyUrl")
if job_url_direct_content:
job_url_direct_match = self.job_url_direct_regex.search(
job_url_direct_content.decode_contents().strip()
)
if job_url_direct_match:
job_url_direct = unquote(job_url_direct_match.group())
return job_url_direct
@staticmethod @staticmethod
def job_type_code(job_type_enum: JobType) -> str: def job_type_code(job_type_enum: JobType) -> str:
return { return {
@@ -315,12 +413,3 @@ class LinkedInScraper(Scraper):
JobType.CONTRACT: "C", JobType.CONTRACT: "C",
JobType.TEMPORARY: "T", JobType.TEMPORARY: "T",
}.get(job_type_enum, "") }.get(job_type_enum, "")
headers = {
"authority": "www.linkedin.com",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"accept-language": "en-US,en;q=0.9",
"cache-control": "max-age=0",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
}

View File

@@ -0,0 +1,8 @@
headers = {
"authority": "www.linkedin.com",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"accept-language": "en-US,en;q=0.9",
"cache-control": "max-age=0",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
}

View File

@@ -1,24 +1,154 @@
from __future__ import annotations from __future__ import annotations
import re
import logging import logging
import re
from itertools import cycle
import numpy as np
import requests import requests
import tls_client import tls_client
import numpy as np import urllib3
from markdownify import markdownify as md from markdownify import markdownify as md
from requests.adapters import HTTPAdapter, Retry from requests.adapters import HTTPAdapter, Retry
from ..jobs import JobType from ..jobs import CompensationInterval, JobType
logger = logging.getLogger("JobSpy") urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
logger.propagate = False
if not logger.handlers:
logger.setLevel(logging.INFO) def create_logger(name: str):
console_handler = logging.StreamHandler() logger = logging.getLogger(f"JobSpy:{name}")
format = "%(asctime)s - %(name)s - %(levelname)s - %(message)s" logger.propagate = False
formatter = logging.Formatter(format) if not logger.handlers:
console_handler.setFormatter(formatter) logger.setLevel(logging.INFO)
logger.addHandler(console_handler) console_handler = logging.StreamHandler()
format = "%(asctime)s - %(levelname)s - %(name)s - %(message)s"
formatter = logging.Formatter(format)
console_handler.setFormatter(formatter)
logger.addHandler(console_handler)
return logger
class RotatingProxySession:
def __init__(self, proxies=None):
if isinstance(proxies, str):
self.proxy_cycle = cycle([self.format_proxy(proxies)])
elif isinstance(proxies, list):
self.proxy_cycle = (
cycle([self.format_proxy(proxy) for proxy in proxies])
if proxies
else None
)
else:
self.proxy_cycle = None
@staticmethod
def format_proxy(proxy):
"""Utility method to format a proxy string into a dictionary."""
if proxy.startswith("http://") or proxy.startswith("https://"):
return {"http": proxy, "https": proxy}
return {"http": f"http://{proxy}", "https": f"http://{proxy}"}
class RequestsRotating(RotatingProxySession, requests.Session):
def __init__(self, proxies=None, has_retry=False, delay=1, clear_cookies=False):
RotatingProxySession.__init__(self, proxies=proxies)
requests.Session.__init__(self)
self.clear_cookies = clear_cookies
self.allow_redirects = True
self.setup_session(has_retry, delay)
def setup_session(self, has_retry, delay):
if has_retry:
retries = Retry(
total=3,
connect=3,
status=3,
status_forcelist=[500, 502, 503, 504, 429],
backoff_factor=delay,
)
adapter = HTTPAdapter(max_retries=retries)
self.mount("http://", adapter)
self.mount("https://", adapter)
def request(self, method, url, **kwargs):
if self.clear_cookies:
self.cookies.clear()
if self.proxy_cycle:
next_proxy = next(self.proxy_cycle)
if next_proxy["http"] != "http://localhost":
self.proxies = next_proxy
else:
self.proxies = {}
return requests.Session.request(self, method, url, **kwargs)
class TLSRotating(RotatingProxySession, tls_client.Session):
def __init__(self, proxies=None):
RotatingProxySession.__init__(self, proxies=proxies)
tls_client.Session.__init__(self, random_tls_extension_order=True)
def execute_request(self, *args, **kwargs):
if self.proxy_cycle:
next_proxy = next(self.proxy_cycle)
if next_proxy["http"] != "http://localhost":
self.proxies = next_proxy
else:
self.proxies = {}
response = tls_client.Session.execute_request(self, *args, **kwargs)
response.ok = response.status_code in range(200, 400)
return response
def create_session(
*,
proxies: dict | str | None = None,
ca_cert: str | None = None,
is_tls: bool = True,
has_retry: bool = False,
delay: int = 1,
clear_cookies: bool = False,
) -> requests.Session:
"""
Creates a requests session with optional tls, proxy, and retry settings.
:return: A session object
"""
if is_tls:
session = TLSRotating(proxies=proxies)
else:
session = RequestsRotating(
proxies=proxies,
has_retry=has_retry,
delay=delay,
clear_cookies=clear_cookies,
)
if ca_cert:
session.verify = ca_cert
return session
def set_logger_level(verbose: int):
"""
Adjusts the logger's level. This function allows the logging level to be changed at runtime.
Parameters:
- verbose: int {0, 1, 2} (default=2, all logs)
"""
if verbose is None:
return
level_name = {2: "INFO", 1: "WARNING", 0: "ERROR"}.get(verbose, "INFO")
level = getattr(logging, level_name.upper(), None)
if level is not None:
for logger_name in logging.root.manager.loggerDict:
if logger_name.startswith("JobSpy:"):
logging.getLogger(logger_name).setLevel(level)
else:
raise ValueError(f"Invalid log level: {level_name}")
def markdown_converter(description_html: str): def markdown_converter(description_html: str):
@@ -35,39 +165,6 @@ def extract_emails_from_text(text: str) -> list[str] | None:
return email_regex.findall(text) return email_regex.findall(text)
def create_session(
proxy: dict | None = None,
is_tls: bool = True,
has_retry: bool = False,
delay: int = 1,
) -> requests.Session:
"""
Creates a requests session with optional tls, proxy, and retry settings.
:return: A session object
"""
if is_tls:
session = tls_client.Session(random_tls_extension_order=True)
session.proxies = proxy
else:
session = requests.Session()
session.allow_redirects = True
if proxy:
session.proxies.update(proxy)
if has_retry:
retries = Retry(
total=3,
connect=3,
status=3,
status_forcelist=[500, 502, 503, 504, 429],
backoff_factor=delay,
)
adapter = HTTPAdapter(max_retries=retries)
session.mount("http://", adapter)
session.mount("https://", adapter)
return session
def get_enum_from_job_type(job_type_str: str) -> JobType | None: def get_enum_from_job_type(job_type_str: str) -> JobType | None:
""" """
Given a string, returns the corresponding JobType enum member if a match is found. Given a string, returns the corresponding JobType enum member if a match is found.
@@ -94,3 +191,98 @@ def currency_parser(cur_str):
num = float(cur_str) num = float(cur_str)
return np.round(num, 2) return np.round(num, 2)
def remove_attributes(tag):
for attr in list(tag.attrs):
del tag[attr]
return tag
def extract_salary(
salary_str,
lower_limit=1000,
upper_limit=700000,
hourly_threshold=350,
monthly_threshold=30000,
enforce_annual_salary=False,
):
"""
Extracts salary information from a string and returns the salary interval, min and max salary values, and currency.
(TODO: Needs test cases as the regex is complicated and may not cover all edge cases)
"""
if not salary_str:
return None, None, None, None
annual_max_salary = None
min_max_pattern = r"\$(\d+(?:,\d+)?(?:\.\d+)?)([kK]?)\s*[-—–]\s*(?:\$)?(\d+(?:,\d+)?(?:\.\d+)?)([kK]?)"
def to_int(s):
return int(float(s.replace(",", "")))
def convert_hourly_to_annual(hourly_wage):
return hourly_wage * 2080
def convert_monthly_to_annual(monthly_wage):
return monthly_wage * 12
match = re.search(min_max_pattern, salary_str)
if match:
min_salary = to_int(match.group(1))
max_salary = to_int(match.group(3))
# Handle 'k' suffix for min and max salaries independently
if "k" in match.group(2).lower() or "k" in match.group(4).lower():
min_salary *= 1000
max_salary *= 1000
# Convert to annual if less than the hourly threshold
if min_salary < hourly_threshold:
interval = CompensationInterval.HOURLY.value
annual_min_salary = convert_hourly_to_annual(min_salary)
if max_salary < hourly_threshold:
annual_max_salary = convert_hourly_to_annual(max_salary)
elif min_salary < monthly_threshold:
interval = CompensationInterval.MONTHLY.value
annual_min_salary = convert_monthly_to_annual(min_salary)
if max_salary < monthly_threshold:
annual_max_salary = convert_monthly_to_annual(max_salary)
else:
interval = CompensationInterval.YEARLY.value
annual_min_salary = min_salary
annual_max_salary = max_salary
# Ensure salary range is within specified limits
if not annual_max_salary:
return None, None, None, None
if (
lower_limit <= annual_min_salary <= upper_limit
and lower_limit <= annual_max_salary <= upper_limit
and annual_min_salary < annual_max_salary
):
if enforce_annual_salary:
return interval, annual_min_salary, annual_max_salary, "USD"
else:
return interval, min_salary, max_salary, "USD"
return None, None, None, None
def extract_job_type(description: str):
if not description:
return []
keywords = {
JobType.FULL_TIME: r"full\s?time",
JobType.PART_TIME: r"part\s?time",
JobType.INTERNSHIP: r"internship",
JobType.CONTRACT: r"contract",
}
listing_types = []
for key, pattern in keywords.items():
if re.search(pattern, description, re.IGNORECASE):
listing_types.append(key)
return listing_types if listing_types else None

View File

@@ -7,19 +7,24 @@ This module contains routines to scrape ZipRecruiter.
from __future__ import annotations from __future__ import annotations
import json
import math import math
import re
import time import time
from concurrent.futures import ThreadPoolExecutor
from datetime import datetime from datetime import datetime
from typing import Optional, Tuple, Any from typing import Optional, Tuple, Any
from concurrent.futures import ThreadPoolExecutor from bs4 import BeautifulSoup
from .constants import headers
from .. import Scraper, ScraperInput, Site from .. import Scraper, ScraperInput, Site
from ..utils import ( from ..utils import (
logger,
extract_emails_from_text, extract_emails_from_text,
create_session, create_session,
markdown_converter, markdown_converter,
remove_attributes,
create_logger,
) )
from ...jobs import ( from ...jobs import (
JobPost, JobPost,
@@ -31,19 +36,25 @@ from ...jobs import (
DescriptionFormat, DescriptionFormat,
) )
log = create_logger("ZipRecruiter")
class ZipRecruiterScraper(Scraper): class ZipRecruiterScraper(Scraper):
base_url = "https://www.ziprecruiter.com" base_url = "https://www.ziprecruiter.com"
api_url = "https://api.ziprecruiter.com" api_url = "https://api.ziprecruiter.com"
def __init__(self, proxy: Optional[str] = None): def __init__(
self, proxies: list[str] | str | None = None, ca_cert: str | None = None
):
""" """
Initializes ZipRecruiterScraper with the ZipRecruiter job search url Initializes ZipRecruiterScraper with the ZipRecruiter job search url
""" """
super().__init__(Site.ZIP_RECRUITER, proxies=proxies)
self.scraper_input = None self.scraper_input = None
self.session = create_session(proxy) self.session = create_session(proxies=proxies, ca_cert=ca_cert)
self.session.headers.update(headers)
self._get_cookies() self._get_cookies()
super().__init__(Site.ZIP_RECRUITER, proxy=proxy)
self.delay = 5 self.delay = 5
self.jobs_per_page = 20 self.jobs_per_page = 20
@@ -65,7 +76,7 @@ class ZipRecruiterScraper(Scraper):
break break
if page > 1: if page > 1:
time.sleep(self.delay) time.sleep(self.delay)
logger.info(f"ZipRecruiter search page: {page}") log.info(f"search page: {page} / {max_pages}")
jobs_on_page, continue_token = self._find_jobs_in_page( jobs_on_page, continue_token = self._find_jobs_in_page(
scraper_input, continue_token scraper_input, continue_token
) )
@@ -91,22 +102,20 @@ class ZipRecruiterScraper(Scraper):
if continue_token: if continue_token:
params["continue_from"] = continue_token params["continue_from"] = continue_token
try: try:
res = self.session.get( res = self.session.get(f"{self.api_url}/jobs-app/jobs", params=params)
f"{self.api_url}/jobs-app/jobs", headers=self.headers, params=params
)
if res.status_code not in range(200, 400): if res.status_code not in range(200, 400):
if res.status_code == 429: if res.status_code == 429:
err = "429 Response - Blocked by ZipRecruiter for too many requests" err = "429 Response - Blocked by ZipRecruiter for too many requests"
else: else:
err = f"ZipRecruiter response status code {res.status_code}" err = f"ZipRecruiter response status code {res.status_code}"
err += f" with response: {res.text}" # ZipRecruiter likely not available in EU err += f" with response: {res.text}" # ZipRecruiter likely not available in EU
logger.error(err) log.error(err)
return jobs_list, "" return jobs_list, ""
except Exception as e: except Exception as e:
if "Proxy responded with" in str(e): if "Proxy responded with" in str(e):
logger.error(f"Indeed: Bad proxy") log.error(f"Indeed: Bad proxy")
else: else:
logger.error(f"Indeed: {str(e)}") log.error(f"Indeed: {str(e)}")
return jobs_list, "" return jobs_list, ""
res_data = res.json() res_data = res.json()
@@ -129,6 +138,7 @@ class ZipRecruiterScraper(Scraper):
self.seen_urls.add(job_url) self.seen_urls.add(job_url)
description = job.get("job_description", "").strip() description = job.get("job_description", "").strip()
listing_type = job.get("buyer_type", "")
description = ( description = (
markdown_converter(description) markdown_converter(description)
if self.scraper_input.description_format == DescriptionFormat.MARKDOWN if self.scraper_input.description_format == DescriptionFormat.MARKDOWN
@@ -150,7 +160,10 @@ class ZipRecruiterScraper(Scraper):
comp_min = int(job["compensation_min"]) if "compensation_min" in job else None comp_min = int(job["compensation_min"]) if "compensation_min" in job else None
comp_max = int(job["compensation_max"]) if "compensation_max" in job else None comp_max = int(job["compensation_max"]) if "compensation_max" in job else None
comp_currency = job.get("compensation_currency") comp_currency = job.get("compensation_currency")
description_full, job_url_direct = self._get_descr(job_url)
return JobPost( return JobPost(
id=f'zr-{job["listing_key"]}',
title=title, title=title,
company_name=company, company_name=company,
location=location, location=location,
@@ -163,14 +176,68 @@ class ZipRecruiterScraper(Scraper):
), ),
date_posted=date_posted, date_posted=date_posted,
job_url=job_url, job_url=job_url,
description=description, description=description_full if description_full else description,
emails=extract_emails_from_text(description) if description else None, emails=extract_emails_from_text(description) if description else None,
job_url_direct=job_url_direct,
listing_type=listing_type,
) )
def _get_descr(self, job_url):
res = self.session.get(job_url, allow_redirects=True)
description_full = job_url_direct = None
if res.ok:
soup = BeautifulSoup(res.text, "html.parser")
job_descr_div = soup.find("div", class_="job_description")
company_descr_section = soup.find("section", class_="company_description")
job_description_clean = (
remove_attributes(job_descr_div).prettify(formatter="html")
if job_descr_div
else ""
)
company_description_clean = (
remove_attributes(company_descr_section).prettify(formatter="html")
if company_descr_section
else ""
)
description_full = job_description_clean + company_description_clean
script_tag = soup.find("script", type="application/json")
if script_tag:
job_json = json.loads(script_tag.string)
job_url_val = job_json["model"].get("saveJobURL", "")
m = re.search(r"job_url=(.+)", job_url_val)
if m:
job_url_direct = m.group(1)
if self.scraper_input.description_format == DescriptionFormat.MARKDOWN:
description_full = markdown_converter(description_full)
return description_full, job_url_direct
def _get_cookies(self): def _get_cookies(self):
data = "event_type=session&logged_in=false&number_of_retry=1&property=model%3AiPhone&property=os%3AiOS&property=locale%3Aen_us&property=app_build_number%3A4734&property=app_version%3A91.0&property=manufacturer%3AApple&property=timestamp%3A2024-01-12T12%3A04%3A42-06%3A00&property=screen_height%3A852&property=os_version%3A16.6.1&property=source%3Ainstall&property=screen_width%3A393&property=device_model%3AiPhone%2014%20Pro&property=brand%3AApple" """
Sends a session event to the API with device properties.
"""
data = [
("event_type", "session"),
("logged_in", "false"),
("number_of_retry", "1"),
("property", "model:iPhone"),
("property", "os:iOS"),
("property", "locale:en_us"),
("property", "app_build_number:4734"),
("property", "app_version:91.0"),
("property", "manufacturer:Apple"),
("property", "timestamp:2025-01-12T12:04:42-06:00"),
("property", "screen_height:852"),
("property", "os_version:16.6.1"),
("property", "source:install"),
("property", "screen_width:393"),
("property", "device_model:iPhone 14 Pro"),
("property", "brand:Apple"),
]
url = f"{self.api_url}/jobs-app/event" url = f"{self.api_url}/jobs-app/event"
self.session.post(url, data=data, headers=self.headers) self.session.post(url, data=data)
@staticmethod @staticmethod
def _get_job_type_enum(job_type_str: str) -> list[JobType] | None: def _get_job_type_enum(job_type_str: str) -> list[JobType] | None:
@@ -198,14 +265,3 @@ class ZipRecruiterScraper(Scraper):
if scraper_input.distance: if scraper_input.distance:
params["radius"] = scraper_input.distance params["radius"] = scraper_input.distance
return {k: v for k, v in params.items() if v is not None} return {k: v for k, v in params.items() if v is not None}
headers = {
"Host": "api.ziprecruiter.com",
"accept": "*/*",
"x-zr-zva-override": "100000000;vid:ZT1huzm_EQlDTVEc",
"x-pushnotificationid": "0ff4983d38d7fc5b3370297f2bcffcf4b3321c418f5c22dd152a0264707602a0",
"x-deviceid": "D77B3A92-E589-46A4-8A39-6EF6F1D86006",
"user-agent": "Job Search/87.0 (iPhone; CPU iOS 16_6_1 like Mac OS X)",
"authorization": "Basic YTBlZjMyZDYtN2I0Yy00MWVkLWEyODMtYTI1NDAzMzI0YTcyOg==",
"accept-language": "en-US,en;q=0.9",
}

View File

@@ -0,0 +1,10 @@
headers = {
"Host": "api.ziprecruiter.com",
"accept": "*/*",
"x-zr-zva-override": "100000000;vid:ZT1huzm_EQlDTVEc",
"x-pushnotificationid": "0ff4983d38d7fc5b3370297f2bcffcf4b3321c418f5c22dd152a0264707602a0",
"x-deviceid": "D77B3A92-E589-46A4-8A39-6EF6F1D86006",
"user-agent": "Job Search/87.0 (iPhone; CPU iOS 16_6_1 like Mac OS X)",
"authorization": "Basic YTBlZjMyZDYtN2I0Yy00MWVkLWEyODMtYTI1NDAzMzI0YTcyOg==",
"accept-language": "en-US,en;q=0.9",
}

View File

View File

@@ -1,14 +0,0 @@
from ..jobspy import scrape_jobs
import pandas as pd
def test_all():
result = scrape_jobs(
site_name=["linkedin", "indeed", "zip_recruiter", "glassdoor"],
search_term="software engineer",
results_wanted=5,
)
assert (
isinstance(result, pd.DataFrame) and not result.empty
), "Result should be a non-empty DataFrame"

View File

@@ -1,11 +0,0 @@
from ..jobspy import scrape_jobs
import pandas as pd
def test_indeed():
result = scrape_jobs(
site_name="glassdoor", search_term="software engineer", country_indeed="USA"
)
assert (
isinstance(result, pd.DataFrame) and not result.empty
), "Result should be a non-empty DataFrame"

View File

@@ -1,11 +0,0 @@
from ..jobspy import scrape_jobs
import pandas as pd
def test_indeed():
result = scrape_jobs(
site_name="indeed", search_term="software engineer", country_indeed="usa"
)
assert (
isinstance(result, pd.DataFrame) and not result.empty
), "Result should be a non-empty DataFrame"

View File

@@ -1,12 +0,0 @@
from ..jobspy import scrape_jobs
import pandas as pd
def test_linkedin():
result = scrape_jobs(
site_name="linkedin",
search_term="software engineer",
)
assert (
isinstance(result, pd.DataFrame) and not result.empty
), "Result should be a non-empty DataFrame"

View File

@@ -1,13 +0,0 @@
from ..jobspy import scrape_jobs
import pandas as pd
def test_ziprecruiter():
result = scrape_jobs(
site_name="zip_recruiter",
search_term="software engineer",
)
assert (
isinstance(result, pd.DataFrame) and not result.empty
), "Result should be a non-empty DataFrame"