From e30995b1b0a1bcc28a0489292b88f58abdde738e Mon Sep 17 00:00:00 2001 From: Yariv Menachem Date: Sun, 15 Dec 2024 15:23:47 +0200 Subject: [PATCH] Update README.md --- README.md | 284 ++++++++++++++---------------------------------------- 1 file changed, 72 insertions(+), 212 deletions(-) diff --git a/README.md b/README.md index da42bc7..b87e11e 100644 --- a/README.md +++ b/README.md @@ -1,209 +1,21 @@ - +# JobSeeker Bot -**JobSpy** is a simple, yet comprehensive, job scraping library. +JobSeeker is a Telegram bot that scrapes job postings from platforms like LinkedIn, Indeed, Glassdoor, and others (currently under development). It gathers job data based on title and location, reformats it into a structured format, and saves it to a MongoDB database. New job posts are automatically sent to a designated Telegram bot chat. -*Looking to build a data-focused software product?* **[Book a call](https://bunsly.com/)** *to -work with us.* +This project is based on the [JobSpy](https://github.com/Bunsly/JobSpy) project. Credits to the original creator. ## Features -- Scrapes job postings from **LinkedIn**, **Indeed**, **Glassdoor**, **Google**, & **ZipRecruiter** simultaneously -- Aggregates the job postings in a dataframe -- Proxies support to bypass blocking +- **Job scraping**: Collects job postings from multiple job platforms. +- **Structured data**: Reformats job data into a structured format for easy processing and storage. +- **Database storage**: Saves job data into a MongoDB database. +- **Telegram integration**: Sends new job postings directly to a Telegram bot chat. -![jobspy](https://github.com/cullenwatson/JobSpy/assets/78247585/ec7ef355-05f6-4fd3-8161-a817e31c5c57) +## Data Structure -### Installation +The scraped job postings are stored in the following format: -``` -pip install -U python-jobspy -``` - -_Python version >= [3.10](https://www.python.org/downloads/release/python-3100/) required_ - -### Usage - -```python -import csv -from jobspy import scrape_jobs - -jobs = scrape_jobs( - site_name=["indeed", "linkedin", "zip_recruiter", "glassdoor", "google"], - search_term="software engineer", - google_search_term="software engineer jobs near San Francisco, CA since yesterday", - location="San Francisco, CA", - results_wanted=20, - hours_old=72, - country_indeed='USA', - - # linkedin_fetch_description=True # gets more info such as description, direct job url (slower) - # proxies=["208.195.175.46:65095", "208.195.175.45:65095", "localhost"], -) -print(f"Found {len(jobs)} jobs") -print(jobs.head()) -jobs.to_csv("jobs.csv", quoting=csv.QUOTE_NONNUMERIC, escapechar="\\", index=False) # to_excel -``` - -### Output - -``` -SITE TITLE COMPANY CITY STATE JOB_TYPE INTERVAL MIN_AMOUNT MAX_AMOUNT JOB_URL DESCRIPTION -indeed Software Engineer AMERICAN SYSTEMS Arlington VA None yearly 200000 150000 https://www.indeed.com/viewjob?jk=5e409e577046... THIS POSITION COMES WITH A 10K SIGNING BONUS!... -indeed Senior Software Engineer TherapyNotes.com Philadelphia PA fulltime yearly 135000 110000 https://www.indeed.com/viewjob?jk=da39574a40cb... About Us TherapyNotes is the national leader i... -linkedin Software Engineer - Early Career Lockheed Martin Sunnyvale CA fulltime yearly None None https://www.linkedin.com/jobs/view/3693012711 Description:By bringing together people that u... -linkedin Full-Stack Software Engineer Rain New York NY fulltime yearly None None https://www.linkedin.com/jobs/view/3696158877 Rain’s mission is to create the fastest and ea... -zip_recruiter Software Engineer - New Grad ZipRecruiter Santa Monica CA fulltime yearly 130000 150000 https://www.ziprecruiter.com/jobs/ziprecruiter... We offer a hybrid work environment. Most US-ba... -zip_recruiter Software Developer TEKsystems Phoenix AZ fulltime hourly 65 75 https://www.ziprecruiter.com/jobs/teksystems-0... Top Skills' Details• 6 years of Java developme... -``` - -### Parameters for `scrape_jobs()` - -```plaintext -Optional -├── site_name (list|str): -| linkedin, zip_recruiter, indeed, glassdoor, google -| (default is all) -│ -├── search_term (str) -| -├── google_search_term (str) -| search term for google jobs. This is the only param for filtering google jobs. -│ -├── location (str) -│ -├── distance (int): -| in miles, default 50 -│ -├── job_type (str): -| fulltime, parttime, internship, contract -│ -├── proxies (list): -| in format ['user:pass@host:port', 'localhost'] -| each job board scraper will round robin through the proxies -| -├── is_remote (bool) -│ -├── results_wanted (int): -| number of job results to retrieve for each site specified in 'site_name' -│ -├── easy_apply (bool): -| filters for jobs that are hosted on the job board site (LinkedIn easy apply filter no longer works) -│ -├── description_format (str): -| markdown, html (Format type of the job descriptions. Default is markdown.) -│ -├── offset (int): -| starts the search from an offset (e.g. 25 will start the search from the 25th result) -│ -├── hours_old (int): -| filters jobs by the number of hours since the job was posted -| (ZipRecruiter and Glassdoor round up to next day.) -│ -├── verbose (int) {0, 1, 2}: -| Controls the verbosity of the runtime printouts -| (0 prints only errors, 1 is errors+warnings, 2 is all logs. Default is 2.) - -├── linkedin_fetch_description (bool): -| fetches full description and direct job url for LinkedIn (Increases requests by O(n)) -│ -├── linkedin_company_ids (list[int]): -| searches for linkedin jobs with specific company ids -| -├── country_indeed (str): -| filters the country on Indeed & Glassdoor (see below for correct spelling) -| -├── enforce_annual_salary (bool): -| converts wages to annual salary -| -├── ca_cert (str) -| path to CA Certificate file for proxies -``` - -``` -├── Indeed limitations: -| Only one from this list can be used in a search: -| - hours_old -| - job_type & is_remote -| - easy_apply -│ -└── LinkedIn limitations: -| Only one from this list can be used in a search: -| - hours_old -| - easy_apply -``` - -## Supported Countries for Job Searching - -### **LinkedIn** - -LinkedIn searches globally & uses only the `location` parameter. - -### **ZipRecruiter** - -ZipRecruiter searches for jobs in **US/Canada** & uses only the `location` parameter. - -### **Indeed / Glassdoor** - -Indeed & Glassdoor supports most countries, but the `country_indeed` parameter is required. Additionally, use the `location` -parameter to narrow down the location, e.g. city & state if necessary. - -You can specify the following countries when searching on Indeed (use the exact name, * indicates support for Glassdoor): - -| | | | | -|----------------------|--------------|------------|----------------| -| Argentina | Australia* | Austria* | Bahrain | -| Belgium* | Brazil* | Canada* | Chile | -| China | Colombia | Costa Rica | Czech Republic | -| Denmark | Ecuador | Egypt | Finland | -| France* | Germany* | Greece | Hong Kong* | -| Hungary | India* | Indonesia | Ireland* | -| Israel | Italy* | Japan | Kuwait | -| Luxembourg | Malaysia | Mexico* | Morocco | -| Netherlands* | New Zealand* | Nigeria | Norway | -| Oman | Pakistan | Panama | Peru | -| Philippines | Poland | Portugal | Qatar | -| Romania | Saudi Arabia | Singapore* | South Africa | -| South Korea | Spain* | Sweden | Switzerland* | -| Taiwan | Thailand | Turkey | Ukraine | -| United Arab Emirates | UK* | USA* | Uruguay | -| Venezuela | Vietnam* | | | - - -## Notes -* Indeed is the best scraper currently with no rate limiting. -* All the job board endpoints are capped at around 1000 jobs on a given search. -* LinkedIn is the most restrictive and usually rate limits around the 10th page with one ip. Proxies are a must basically. - -## Frequently Asked Questions - ---- -**Q: Why is Indeed giving unrelated roles?** -**A:** Indeed searches the description too. - -- use - to remove words -- "" for exact match - -Example of a good Indeed query - -```py -search_term='"engineering intern" software summer (java OR python OR c++) 2025 -tax -marketing' -``` - -This searches the description/title and must include software, summer, 2025, one of the languages, engineering intern exactly, no tax, no marketing. - ---- - -**Q: Received a response code 429?** -**A:** This indicates that you have been blocked by the job board site for sending too many requests. All of the job board sites are aggressive with blocking. We recommend: - -- Wait some time between scrapes (site-dependent). -- Try using the proxies param to change your IP address. - ---- - -### JobPost Schema - -```plaintext +```yaml JobPost ├── title ├── company @@ -224,18 +36,66 @@ JobPost ├── date_posted ├── emails └── is_remote - -Linkedin specific -└── job_level - -Linkedin & Indeed specific -└── company_industry - -Indeed specific -├── company_country -├── company_addresses -├── company_employees_label -├── company_revenue_label -├── company_description -└── company_logo ``` + +## Prerequisites + +- Python 3.8+ +- MongoDB +- Telegram bot token (create a bot via [BotFather](https://core.telegram.org/bots#botfather)) + +## Installation + +1. **Clone the repository**: + ```bash + git clone https://github.com/yariv245/JobSeeker.git + cd JobSeeker + ``` + +2. **Install dependencies**: + ```bash + pip install -r requirements.txt + ``` + +3. **Set up environment variables**: + Create a `.env` file in the root directory with the following: + ```env + TELEGRAM_BOT_TOKEN=your_telegram_bot_token + MONGO_URI=your_mongodb_connection_string + TELEGRAM_CHAT_ID=your_telegram_chat_id + ``` + +4. **Run the bot**: + ```bash + python bot.py + ``` + +## Usage + +- Add the bot to a Telegram group or chat. +- Start the bot to receive job postings as they are scraped. + +## Testing + +This project includes testing to ensure data scraping, formatting, and Telegram integration work as expected. Run the tests using: + +```bash +pytest +``` + +Ensure you have the necessary environment variables and mock data set up before running the tests. + +## Contributing + +Contributions are welcome! Please follow these steps: + +1. Fork the repository. +2. Create a new branch (`git checkout -b feature/your-feature-name`). +3. Commit your changes (`git commit -m 'Add some feature'`). +4. Push to the branch (`git push origin feature/your-feature-name`). +5. Open a pull request. + +## Acknowledgments + +- [JobSpy](https://github.com/Bunsly/JobSpy) for inspiring this project. +