mirror of https://github.com/Bunsly/JobSpy
Update README.md
parent
f0ea89b357
commit
e30995b1b0
284
README.md
284
README.md
|
@ -1,209 +1,21 @@
|
|||
<img src="https://github.com/cullenwatson/JobSpy/assets/78247585/ae185b7e-e444-4712-8bb9-fa97f53e896b" width="400">
|
||||
# JobSeeker Bot
|
||||
|
||||
**JobSpy** is a simple, yet comprehensive, job scraping library.
|
||||
JobSeeker is a Telegram bot that scrapes job postings from platforms like LinkedIn, Indeed, Glassdoor, and others (currently under development). It gathers job data based on title and location, reformats it into a structured format, and saves it to a MongoDB database. New job posts are automatically sent to a designated Telegram bot chat.
|
||||
|
||||
*Looking to build a data-focused software product?* **[Book a call](https://bunsly.com/)** *to
|
||||
work with us.*
|
||||
This project is based on the [JobSpy](https://github.com/Bunsly/JobSpy) project. Credits to the original creator.
|
||||
|
||||
## Features
|
||||
|
||||
- Scrapes job postings from **LinkedIn**, **Indeed**, **Glassdoor**, **Google**, & **ZipRecruiter** simultaneously
|
||||
- Aggregates the job postings in a dataframe
|
||||
- Proxies support to bypass blocking
|
||||
- **Job scraping**: Collects job postings from multiple job platforms.
|
||||
- **Structured data**: Reformats job data into a structured format for easy processing and storage.
|
||||
- **Database storage**: Saves job data into a MongoDB database.
|
||||
- **Telegram integration**: Sends new job postings directly to a Telegram bot chat.
|
||||
|
||||

|
||||
## Data Structure
|
||||
|
||||
### Installation
|
||||
The scraped job postings are stored in the following format:
|
||||
|
||||
```
|
||||
pip install -U python-jobspy
|
||||
```
|
||||
|
||||
_Python version >= [3.10](https://www.python.org/downloads/release/python-3100/) required_
|
||||
|
||||
### Usage
|
||||
|
||||
```python
|
||||
import csv
|
||||
from jobspy import scrape_jobs
|
||||
|
||||
jobs = scrape_jobs(
|
||||
site_name=["indeed", "linkedin", "zip_recruiter", "glassdoor", "google"],
|
||||
search_term="software engineer",
|
||||
google_search_term="software engineer jobs near San Francisco, CA since yesterday",
|
||||
location="San Francisco, CA",
|
||||
results_wanted=20,
|
||||
hours_old=72,
|
||||
country_indeed='USA',
|
||||
|
||||
# linkedin_fetch_description=True # gets more info such as description, direct job url (slower)
|
||||
# proxies=["208.195.175.46:65095", "208.195.175.45:65095", "localhost"],
|
||||
)
|
||||
print(f"Found {len(jobs)} jobs")
|
||||
print(jobs.head())
|
||||
jobs.to_csv("jobs.csv", quoting=csv.QUOTE_NONNUMERIC, escapechar="\\", index=False) # to_excel
|
||||
```
|
||||
|
||||
### Output
|
||||
|
||||
```
|
||||
SITE TITLE COMPANY CITY STATE JOB_TYPE INTERVAL MIN_AMOUNT MAX_AMOUNT JOB_URL DESCRIPTION
|
||||
indeed Software Engineer AMERICAN SYSTEMS Arlington VA None yearly 200000 150000 https://www.indeed.com/viewjob?jk=5e409e577046... THIS POSITION COMES WITH A 10K SIGNING BONUS!...
|
||||
indeed Senior Software Engineer TherapyNotes.com Philadelphia PA fulltime yearly 135000 110000 https://www.indeed.com/viewjob?jk=da39574a40cb... About Us TherapyNotes is the national leader i...
|
||||
linkedin Software Engineer - Early Career Lockheed Martin Sunnyvale CA fulltime yearly None None https://www.linkedin.com/jobs/view/3693012711 Description:By bringing together people that u...
|
||||
linkedin Full-Stack Software Engineer Rain New York NY fulltime yearly None None https://www.linkedin.com/jobs/view/3696158877 Rain’s mission is to create the fastest and ea...
|
||||
zip_recruiter Software Engineer - New Grad ZipRecruiter Santa Monica CA fulltime yearly 130000 150000 https://www.ziprecruiter.com/jobs/ziprecruiter... We offer a hybrid work environment. Most US-ba...
|
||||
zip_recruiter Software Developer TEKsystems Phoenix AZ fulltime hourly 65 75 https://www.ziprecruiter.com/jobs/teksystems-0... Top Skills' Details• 6 years of Java developme...
|
||||
```
|
||||
|
||||
### Parameters for `scrape_jobs()`
|
||||
|
||||
```plaintext
|
||||
Optional
|
||||
├── site_name (list|str):
|
||||
| linkedin, zip_recruiter, indeed, glassdoor, google
|
||||
| (default is all)
|
||||
│
|
||||
├── search_term (str)
|
||||
|
|
||||
├── google_search_term (str)
|
||||
| search term for google jobs. This is the only param for filtering google jobs.
|
||||
│
|
||||
├── location (str)
|
||||
│
|
||||
├── distance (int):
|
||||
| in miles, default 50
|
||||
│
|
||||
├── job_type (str):
|
||||
| fulltime, parttime, internship, contract
|
||||
│
|
||||
├── proxies (list):
|
||||
| in format ['user:pass@host:port', 'localhost']
|
||||
| each job board scraper will round robin through the proxies
|
||||
|
|
||||
├── is_remote (bool)
|
||||
│
|
||||
├── results_wanted (int):
|
||||
| number of job results to retrieve for each site specified in 'site_name'
|
||||
│
|
||||
├── easy_apply (bool):
|
||||
| filters for jobs that are hosted on the job board site (LinkedIn easy apply filter no longer works)
|
||||
│
|
||||
├── description_format (str):
|
||||
| markdown, html (Format type of the job descriptions. Default is markdown.)
|
||||
│
|
||||
├── offset (int):
|
||||
| starts the search from an offset (e.g. 25 will start the search from the 25th result)
|
||||
│
|
||||
├── hours_old (int):
|
||||
| filters jobs by the number of hours since the job was posted
|
||||
| (ZipRecruiter and Glassdoor round up to next day.)
|
||||
│
|
||||
├── verbose (int) {0, 1, 2}:
|
||||
| Controls the verbosity of the runtime printouts
|
||||
| (0 prints only errors, 1 is errors+warnings, 2 is all logs. Default is 2.)
|
||||
|
||||
├── linkedin_fetch_description (bool):
|
||||
| fetches full description and direct job url for LinkedIn (Increases requests by O(n))
|
||||
│
|
||||
├── linkedin_company_ids (list[int]):
|
||||
| searches for linkedin jobs with specific company ids
|
||||
|
|
||||
├── country_indeed (str):
|
||||
| filters the country on Indeed & Glassdoor (see below for correct spelling)
|
||||
|
|
||||
├── enforce_annual_salary (bool):
|
||||
| converts wages to annual salary
|
||||
|
|
||||
├── ca_cert (str)
|
||||
| path to CA Certificate file for proxies
|
||||
```
|
||||
|
||||
```
|
||||
├── Indeed limitations:
|
||||
| Only one from this list can be used in a search:
|
||||
| - hours_old
|
||||
| - job_type & is_remote
|
||||
| - easy_apply
|
||||
│
|
||||
└── LinkedIn limitations:
|
||||
| Only one from this list can be used in a search:
|
||||
| - hours_old
|
||||
| - easy_apply
|
||||
```
|
||||
|
||||
## Supported Countries for Job Searching
|
||||
|
||||
### **LinkedIn**
|
||||
|
||||
LinkedIn searches globally & uses only the `location` parameter.
|
||||
|
||||
### **ZipRecruiter**
|
||||
|
||||
ZipRecruiter searches for jobs in **US/Canada** & uses only the `location` parameter.
|
||||
|
||||
### **Indeed / Glassdoor**
|
||||
|
||||
Indeed & Glassdoor supports most countries, but the `country_indeed` parameter is required. Additionally, use the `location`
|
||||
parameter to narrow down the location, e.g. city & state if necessary.
|
||||
|
||||
You can specify the following countries when searching on Indeed (use the exact name, * indicates support for Glassdoor):
|
||||
|
||||
| | | | |
|
||||
|----------------------|--------------|------------|----------------|
|
||||
| Argentina | Australia* | Austria* | Bahrain |
|
||||
| Belgium* | Brazil* | Canada* | Chile |
|
||||
| China | Colombia | Costa Rica | Czech Republic |
|
||||
| Denmark | Ecuador | Egypt | Finland |
|
||||
| France* | Germany* | Greece | Hong Kong* |
|
||||
| Hungary | India* | Indonesia | Ireland* |
|
||||
| Israel | Italy* | Japan | Kuwait |
|
||||
| Luxembourg | Malaysia | Mexico* | Morocco |
|
||||
| Netherlands* | New Zealand* | Nigeria | Norway |
|
||||
| Oman | Pakistan | Panama | Peru |
|
||||
| Philippines | Poland | Portugal | Qatar |
|
||||
| Romania | Saudi Arabia | Singapore* | South Africa |
|
||||
| South Korea | Spain* | Sweden | Switzerland* |
|
||||
| Taiwan | Thailand | Turkey | Ukraine |
|
||||
| United Arab Emirates | UK* | USA* | Uruguay |
|
||||
| Venezuela | Vietnam* | | |
|
||||
|
||||
|
||||
## Notes
|
||||
* Indeed is the best scraper currently with no rate limiting.
|
||||
* All the job board endpoints are capped at around 1000 jobs on a given search.
|
||||
* LinkedIn is the most restrictive and usually rate limits around the 10th page with one ip. Proxies are a must basically.
|
||||
|
||||
## Frequently Asked Questions
|
||||
|
||||
---
|
||||
**Q: Why is Indeed giving unrelated roles?**
|
||||
**A:** Indeed searches the description too.
|
||||
|
||||
- use - to remove words
|
||||
- "" for exact match
|
||||
|
||||
Example of a good Indeed query
|
||||
|
||||
```py
|
||||
search_term='"engineering intern" software summer (java OR python OR c++) 2025 -tax -marketing'
|
||||
```
|
||||
|
||||
This searches the description/title and must include software, summer, 2025, one of the languages, engineering intern exactly, no tax, no marketing.
|
||||
|
||||
---
|
||||
|
||||
**Q: Received a response code 429?**
|
||||
**A:** This indicates that you have been blocked by the job board site for sending too many requests. All of the job board sites are aggressive with blocking. We recommend:
|
||||
|
||||
- Wait some time between scrapes (site-dependent).
|
||||
- Try using the proxies param to change your IP address.
|
||||
|
||||
---
|
||||
|
||||
### JobPost Schema
|
||||
|
||||
```plaintext
|
||||
```yaml
|
||||
JobPost
|
||||
├── title
|
||||
├── company
|
||||
|
@ -224,18 +36,66 @@ JobPost
|
|||
├── date_posted
|
||||
├── emails
|
||||
└── is_remote
|
||||
|
||||
Linkedin specific
|
||||
└── job_level
|
||||
|
||||
Linkedin & Indeed specific
|
||||
└── company_industry
|
||||
|
||||
Indeed specific
|
||||
├── company_country
|
||||
├── company_addresses
|
||||
├── company_employees_label
|
||||
├── company_revenue_label
|
||||
├── company_description
|
||||
└── company_logo
|
||||
```
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Python 3.8+
|
||||
- MongoDB
|
||||
- Telegram bot token (create a bot via [BotFather](https://core.telegram.org/bots#botfather))
|
||||
|
||||
## Installation
|
||||
|
||||
1. **Clone the repository**:
|
||||
```bash
|
||||
git clone https://github.com/yariv245/JobSeeker.git
|
||||
cd JobSeeker
|
||||
```
|
||||
|
||||
2. **Install dependencies**:
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
3. **Set up environment variables**:
|
||||
Create a `.env` file in the root directory with the following:
|
||||
```env
|
||||
TELEGRAM_BOT_TOKEN=your_telegram_bot_token
|
||||
MONGO_URI=your_mongodb_connection_string
|
||||
TELEGRAM_CHAT_ID=your_telegram_chat_id
|
||||
```
|
||||
|
||||
4. **Run the bot**:
|
||||
```bash
|
||||
python bot.py
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
- Add the bot to a Telegram group or chat.
|
||||
- Start the bot to receive job postings as they are scraped.
|
||||
|
||||
## Testing
|
||||
|
||||
This project includes testing to ensure data scraping, formatting, and Telegram integration work as expected. Run the tests using:
|
||||
|
||||
```bash
|
||||
pytest
|
||||
```
|
||||
|
||||
Ensure you have the necessary environment variables and mock data set up before running the tests.
|
||||
|
||||
## Contributing
|
||||
|
||||
Contributions are welcome! Please follow these steps:
|
||||
|
||||
1. Fork the repository.
|
||||
2. Create a new branch (`git checkout -b feature/your-feature-name`).
|
||||
3. Commit your changes (`git commit -m 'Add some feature'`).
|
||||
4. Push to the branch (`git push origin feature/your-feature-name`).
|
||||
5. Open a pull request.
|
||||
|
||||
## Acknowledgments
|
||||
|
||||
- [JobSpy](https://github.com/Bunsly/JobSpy) for inspiring this project.
|
||||
|
||||
|
|
Loading…
Reference in New Issue