JobSpy/README.md

248 lines
9.2 KiB
Markdown
Raw Normal View History

2023-09-04 20:58:46 -07:00
<img src="https://github.com/cullenwatson/JobSpy/assets/78247585/ae185b7e-e444-4712-8bb9-fa97f53e896b" width="400">
2023-07-10 20:14:38 -07:00
2023-09-03 07:29:25 -07:00
**JobSpy** is a simple, yet comprehensive, job scraping library.
2023-09-15 11:51:22 -07:00
2023-11-06 21:13:19 -08:00
*Looking to build a data-focused software product?* **[Book a call](https://bunsly.com/)** *to
2023-11-30 10:49:31 -08:00
work with us.*
2023-07-10 20:14:38 -07:00
## Features
2023-09-04 20:52:21 -07:00
2024-10-24 13:19:40 -07:00
- Scrapes job postings from **LinkedIn**, **Indeed**, **Glassdoor**, **Google**, & **ZipRecruiter** simultaneously
2024-10-28 01:53:59 -07:00
- Aggregates the job postings in a dataframe
- Proxies support to bypass blocking
2023-09-03 18:05:31 -07:00
![jobspy](https://github.com/cullenwatson/JobSpy/assets/78247585/ec7ef355-05f6-4fd3-8161-a817e31c5c57)
2023-09-03 07:29:25 -07:00
### Installation
2023-09-05 11:03:32 -07:00
```
2024-03-08 23:40:01 -08:00
pip install -U python-jobspy
2023-09-05 11:03:32 -07:00
```
_Python version >= [3.10](https://www.python.org/downloads/release/python-3100/) required_
2023-09-03 07:29:25 -07:00
### Usage
```python
import csv
2023-09-03 10:30:13 -07:00
from jobspy import scrape_jobs
2023-09-03 07:29:25 -07:00
jobs = scrape_jobs(
2024-10-24 13:19:40 -07:00
site_name=["indeed", "linkedin", "zip_recruiter", "glassdoor", "google"],
2023-09-07 11:35:10 -07:00
search_term="software engineer",
2024-10-25 12:54:14 -07:00
google_search_term="software engineer jobs near San Francisco, CA since yesterday",
2024-10-24 13:19:40 -07:00
location="San Francisco, CA",
results_wanted=20,
2024-10-28 01:53:59 -07:00
hours_old=72,
country_indeed='USA',
2024-10-28 01:53:59 -07:00
# linkedin_fetch_description=True # gets more info such as description, direct job url (slower)
2024-05-28 13:39:24 -07:00
# proxies=["208.195.175.46:65095", "208.195.175.45:65095", "localhost"],
2023-09-03 07:29:25 -07:00
)
print(f"Found {len(jobs)} jobs")
print(jobs.head())
2024-05-28 14:01:29 -07:00
jobs.to_csv("jobs.csv", quoting=csv.QUOTE_NONNUMERIC, escapechar="\\", index=False) # to_excel
2023-09-03 07:29:25 -07:00
```
### Output
2023-09-03 07:29:25 -07:00
```
2024-03-11 12:45:17 -07:00
SITE TITLE COMPANY CITY STATE JOB_TYPE INTERVAL MIN_AMOUNT MAX_AMOUNT JOB_URL DESCRIPTION
2023-09-03 16:11:18 -07:00
indeed Software Engineer AMERICAN SYSTEMS Arlington VA None yearly 200000 150000 https://www.indeed.com/viewjob?jk=5e409e577046... THIS POSITION COMES WITH A 10K SIGNING BONUS!...
indeed Senior Software Engineer TherapyNotes.com Philadelphia PA fulltime yearly 135000 110000 https://www.indeed.com/viewjob?jk=da39574a40cb... About Us TherapyNotes is the national leader i...
linkedin Software Engineer - Early Career Lockheed Martin Sunnyvale CA fulltime yearly None None https://www.linkedin.com/jobs/view/3693012711 Description:By bringing together people that u...
linkedin Full-Stack Software Engineer Rain New York NY fulltime yearly None None https://www.linkedin.com/jobs/view/3696158877 Rains mission is to create the fastest and ea...
zip_recruiter Software Engineer - New Grad ZipRecruiter Santa Monica CA fulltime yearly 130000 150000 https://www.ziprecruiter.com/jobs/ziprecruiter... We offer a hybrid work environment. Most US-ba...
zip_recruiter Software Developer TEKsystems Phoenix AZ fulltime hourly 65 75 https://www.ziprecruiter.com/jobs/teksystems-0... Top Skills' Details• 6 years of Java developme...
2023-09-03 07:29:25 -07:00
```
2023-09-03 07:29:25 -07:00
### Parameters for `scrape_jobs()`
2023-08-28 10:36:54 -07:00
```plaintext
2023-08-27 14:52:27 -07:00
Optional
2024-05-25 09:46:23 -07:00
├── site_name (list|str):
2024-10-24 13:22:31 -07:00
| linkedin, zip_recruiter, indeed, glassdoor, google
| (default is all)
2024-05-25 09:46:23 -07:00
2024-03-11 12:52:20 -07:00
├── search_term (str)
2024-10-25 12:54:14 -07:00
|
├── google_search_term (str)
2024-10-25 16:12:08 -07:00
| search term for google jobs. This is the only param for filtering google jobs.
2024-05-25 09:46:23 -07:00
2024-03-08 23:40:01 -08:00
├── location (str)
2024-05-25 09:46:23 -07:00
├── distance (int):
| in miles, default 50
├── job_type (str):
| fulltime, parttime, internship, contract
2024-05-29 17:32:32 -07:00
├── proxies (list):
| in format ['user:pass@host:port', 'localhost']
2024-07-15 19:30:11 -07:00
| each job board scraper will round robin through the proxies
2024-10-08 15:49:06 -07:00
|
2023-08-27 14:52:27 -07:00
├── is_remote (bool)
2024-05-25 09:46:23 -07:00
├── results_wanted (int):
| number of job results to retrieve for each site specified in 'site_name'
├── easy_apply (bool):
2024-11-26 13:48:22 -08:00
| filters for jobs that are hosted on the job board site (LinkedIn easy apply filter no longer works)
2024-05-25 09:46:23 -07:00
├── description_format (str):
| markdown, html (Format type of the job descriptions. Default is markdown.)
├── offset (int):
| starts the search from an offset (e.g. 25 will start the search from the 25th result)
├── hours_old (int):
| filters jobs by the number of hours since the job was posted
| (ZipRecruiter and Glassdoor round up to next day.)
├── verbose (int) {0, 1, 2}:
| Controls the verbosity of the runtime printouts
| (0 prints only errors, 1 is errors+warnings, 2 is all logs. Default is 2.)
├── linkedin_fetch_description (bool):
| fetches full description and direct job url for LinkedIn (Increases requests by O(n))
├── linkedin_company_ids (list[int]):
| searches for linkedin jobs with specific company ids
|
├── country_indeed (str):
| filters the country on Indeed & Glassdoor (see below for correct spelling)
2024-07-17 19:21:22 -07:00
|
├── enforce_annual_salary (bool):
| converts wages to annual salary
2024-10-24 13:19:40 -07:00
|
├── ca_cert (str)
| path to CA Certificate file for proxies
2024-05-25 09:46:23 -07:00
```
```
2024-05-25 09:46:23 -07:00
├── Indeed limitations:
| Only one from this list can be used in a search:
| - hours_old
| - job_type & is_remote
| - easy_apply
└── LinkedIn limitations:
| Only one from this list can be used in a search:
| - hours_old
| - easy_apply
```
2023-09-05 10:17:22 -07:00
## Supported Countries for Job Searching
2024-10-25 12:54:14 -07:00
### **LinkedIn**
2023-09-05 10:17:22 -07:00
2024-10-25 12:54:14 -07:00
LinkedIn searches globally & uses only the `location` parameter.
2023-09-05 10:17:22 -07:00
### **ZipRecruiter**
2023-09-07 11:46:14 -07:00
ZipRecruiter searches for jobs in **US/Canada** & uses only the `location` parameter.
2023-09-05 10:17:22 -07:00
### **Indeed / Glassdoor**
2023-10-30 17:57:36 -07:00
Indeed & Glassdoor supports most countries, but the `country_indeed` parameter is required. Additionally, use the `location`
parameter to narrow down the location, e.g. city & state if necessary.
2023-10-30 17:57:36 -07:00
You can specify the following countries when searching on Indeed (use the exact name, * indicates support for Glassdoor):
| | | | |
|----------------------|--------------|------------|----------------|
2023-10-30 17:57:36 -07:00
| Argentina | Australia* | Austria* | Bahrain |
| Belgium* | Brazil* | Canada* | Chile |
| China | Colombia | Costa Rica | Czech Republic |
| Denmark | Ecuador | Egypt | Finland |
2023-10-30 17:57:36 -07:00
| France* | Germany* | Greece | Hong Kong* |
| Hungary | India* | Indonesia | Ireland* |
| Israel | Italy* | Japan | Kuwait |
| Luxembourg | Malaysia | Mexico* | Morocco |
| Netherlands* | New Zealand* | Nigeria | Norway |
| Oman | Pakistan | Panama | Peru |
| Philippines | Poland | Portugal | Qatar |
2023-10-30 17:57:36 -07:00
| Romania | Saudi Arabia | Singapore* | South Africa |
| South Korea | Spain* | Sweden | Switzerland* |
| Taiwan | Thailand | Turkey | Ukraine |
2023-10-30 17:57:36 -07:00
| United Arab Emirates | UK* | USA* | Uruguay |
2024-03-04 15:35:57 -08:00
| Venezuela | Vietnam* | | |
2023-08-28 10:51:05 -07:00
2023-10-30 17:57:36 -07:00
2024-03-08 23:40:01 -08:00
## Notes
* Indeed is the best scraper currently with no rate limiting.
2024-03-11 12:41:12 -07:00
* All the job board endpoints are capped at around 1000 jobs on a given search.
* LinkedIn is the most restrictive and usually rate limits around the 10th page with one ip. Proxies are a must basically.
2024-03-08 23:49:05 -08:00
2023-09-03 18:05:31 -07:00
## Frequently Asked Questions
---
**Q: Why is Indeed giving unrelated roles?**
2024-10-28 01:34:52 -07:00
**A:** Indeed searches the description too.
- use - to remove words
- "" for exact match
2024-10-28 01:42:19 -07:00
Example of a good Indeed query
2024-10-28 01:34:52 -07:00
2024-10-28 01:36:21 -07:00
```py
2024-10-28 01:53:59 -07:00
search_term='"engineering intern" software summer (java OR python OR c++) 2025 -tax -marketing'
2024-10-28 01:36:21 -07:00
```
2024-10-28 01:34:52 -07:00
2024-10-28 01:53:59 -07:00
This searches the description/title and must include software, summer, 2025, one of the languages, engineering intern exactly, no tax, no marketing.
2023-09-03 18:05:31 -07:00
---
**Q: Received a response code 429?**
2023-10-10 09:54:14 -07:00
**A:** This indicates that you have been blocked by the job board site for sending too many requests. All of the job board sites are aggressive with blocking. We recommend:
2023-09-03 18:05:31 -07:00
- Wait some time between scrapes (site-dependent).
- Try using the proxies param to change your IP address.
2023-09-03 18:05:31 -07:00
2024-03-08 23:49:05 -08:00
---
**Q: Encountering issues with your queries?**
**A:** Try reducing the number of `results_wanted` and/or broadening the filters. If problems
persist, [submit an issue](https://github.com/Bunsly/JobSpy/issues).
---
2024-11-26 13:47:10 -08:00
### JobPost Schema
```plaintext
JobPost
├── title
├── company
├── company_url
├── job_url
├── location
│ ├── country
│ ├── city
│ ├── state
├── description
├── job_type: fulltime, parttime, internship, contract
├── job_function
│ ├── interval: yearly, monthly, weekly, daily, hourly
│ ├── min_amount
│ ├── max_amount
│ ├── currency
│ └── salary_source: direct_data, description (parsed from posting)
├── date_posted
├── emails
└── is_remote
Linkedin specific
└── job_level
Linkedin & Indeed specific
└── company_industry
Indeed specific
├── company_country
├── company_addresses
├── company_employees_label
├── company_revenue_label
├── company_description
└── company_logo
```