[enh]: extract emails

pull/51/head
Cullen Watson 2023-09-28 18:09:21 -05:00
parent c802c8c3b8
commit e4b925605d
13 changed files with 990 additions and 969 deletions

View File

@ -4,21 +4,25 @@
**Not technical?** Try out the web scraping tool on our site at [usejobspy.com](https://usejobspy.com). **Not technical?** Try out the web scraping tool on our site at [usejobspy.com](https://usejobspy.com).
*Looking to build a data-focused software product?* **[Book a call](https://calendly.com/zachary-products/15min)** *to work with us.* *Looking to build a data-focused software product?* **[Book a call](https://calendly.com/zachary-products/15min)** *to
work with us.*
\ \
Check out another project we wrote: ***[HomeHarvest](https://github.com/ZacharyHampton/HomeHarvest)** a Python package for real estate scraping* Check out another project we wrote: ***[HomeHarvest](https://github.com/ZacharyHampton/HomeHarvest)** a Python package
## Features for real estate scraping*
## Features
- Scrapes job postings from **LinkedIn**, **Indeed** & **ZipRecruiter** simultaneously - Scrapes job postings from **LinkedIn**, **Indeed** & **ZipRecruiter** simultaneously
- Aggregates the job postings in a Pandas DataFrame - Aggregates the job postings in a Pandas DataFrame
- Proxy support (HTTP/S, SOCKS) - Proxy support (HTTP/S, SOCKS)
[Video Guide for JobSpy](https://www.youtube.com/watch?v=RuP1HrAZnxs&pp=ygUgam9icyBzY3JhcGVyIGJvdCBsaW5rZWRpbiBpbmRlZWQ%3D) - Updated for release v1.1.3 [Video Guide for JobSpy](https://www.youtube.com/watch?v=RuP1HrAZnxs&pp=ygUgam9icyBzY3JhcGVyIGJvdCBsaW5rZWRpbiBpbmRlZWQ%3D) -
Updated for release v1.1.3
![jobspy](https://github.com/cullenwatson/JobSpy/assets/78247585/ec7ef355-05f6-4fd3-8161-a817e31c5c57) ![jobspy](https://github.com/cullenwatson/JobSpy/assets/78247585/ec7ef355-05f6-4fd3-8161-a817e31c5c57)
### Installation ### Installation
``` ```
pip install --upgrade python-jobspy pip install --upgrade python-jobspy
``` ```
@ -65,6 +69,7 @@ print(jobs)
``` ```
### Output ### Output
``` ```
SITE TITLE COMPANY_NAME CITY STATE JOB_TYPE INTERVAL MIN_AMOUNT MAX_AMOUNT JOB_URL DESCRIPTION SITE TITLE COMPANY_NAME CITY STATE JOB_TYPE INTERVAL MIN_AMOUNT MAX_AMOUNT JOB_URL DESCRIPTION
indeed Software Engineer AMERICAN SYSTEMS Arlington VA None yearly 200000 150000 https://www.indeed.com/viewjob?jk=5e409e577046... THIS POSITION COMES WITH A 10K SIGNING BONUS!... indeed Software Engineer AMERICAN SYSTEMS Arlington VA None yearly 200000 150000 https://www.indeed.com/viewjob?jk=5e409e577046... THIS POSITION COMES WITH A 10K SIGNING BONUS!...
@ -74,7 +79,9 @@ linkedin Full-Stack Software Engineer Rain New York
zip_recruiter Software Engineer - New Grad ZipRecruiter Santa Monica CA fulltime yearly 130000 150000 https://www.ziprecruiter.com/jobs/ziprecruiter... We offer a hybrid work environment. Most US-ba... zip_recruiter Software Engineer - New Grad ZipRecruiter Santa Monica CA fulltime yearly 130000 150000 https://www.ziprecruiter.com/jobs/ziprecruiter... We offer a hybrid work environment. Most US-ba...
zip_recruiter Software Developer TEKsystems Phoenix AZ fulltime hourly 65 75 https://www.ziprecruiter.com/jobs/teksystems-0... Top Skills' Details• 6 years of Java developme... zip_recruiter Software Developer TEKsystems Phoenix AZ fulltime hourly 65 75 https://www.ziprecruiter.com/jobs/teksystems-0... Top Skills' Details• 6 years of Java developme...
``` ```
### Parameters for `scrape_jobs()` ### Parameters for `scrape_jobs()`
```plaintext ```plaintext
Required Required
├── site_type (List[enum]): linkedin, zip_recruiter, indeed ├── site_type (List[enum]): linkedin, zip_recruiter, indeed
@ -91,8 +98,8 @@ Optional
├── offset (enum): starts the search from an offset (e.g. 25 will start the search from the 25th result) ├── offset (enum): starts the search from an offset (e.g. 25 will start the search from the 25th result)
``` ```
### JobPost Schema ### JobPost Schema
```plaintext ```plaintext
JobPost JobPost
├── title (str) ├── title (str)
@ -113,14 +120,15 @@ JobPost
``` ```
### Exceptions ### Exceptions
The following exceptions may be raised when using JobSpy: The following exceptions may be raised when using JobSpy:
* `LinkedInException` * `LinkedInException`
* `IndeedException` * `IndeedException`
* `ZipRecruiterException` * `ZipRecruiterException`
## Supported Countries for Job Searching ## Supported Countries for Job Searching
### **LinkedIn** ### **LinkedIn**
LinkedIn searches globally & uses only the `location` parameter. LinkedIn searches globally & uses only the `location` parameter.
@ -129,15 +137,15 @@ LinkedIn searches globally & uses only the `location` parameter.
ZipRecruiter searches for jobs in **US/Canada** & uses only the `location` parameter. ZipRecruiter searches for jobs in **US/Canada** & uses only the `location` parameter.
### **Indeed** ### **Indeed**
Indeed supports most countries, but the `country_indeed` parameter is required. Additionally, use the `location` parameter to narrow down the location, e.g. city & state if necessary.
Indeed supports most countries, but the `country_indeed` parameter is required. Additionally, use the `location`
parameter to narrow down the location, e.g. city & state if necessary.
You can specify the following countries when searching on Indeed (use the exact name): You can specify the following countries when searching on Indeed (use the exact name):
| | | | | | | | | |
|------|------|------|------| |----------------------|--------------|------------|----------------|
| Argentina | Australia | Austria | Bahrain | | Argentina | Australia | Austria | Bahrain |
| Belgium | Brazil | Canada | Chile | | Belgium | Brazil | Canada | Chile |
| China | Colombia | Costa Rica | Czech Republic | | China | Colombia | Costa Rica | Czech Republic |
@ -160,12 +168,14 @@ You can specify the following countries when searching on Indeed (use the exact
--- ---
**Q: Encountering issues with your queries?** **Q: Encountering issues with your queries?**
**A:** Try reducing the number of `results_wanted` and/or broadening the filters. If problems persist, [submit an issue](https://github.com/cullenwatson/JobSpy/issues). **A:** Try reducing the number of `results_wanted` and/or broadening the filters. If problems
persist, [submit an issue](https://github.com/cullenwatson/JobSpy/issues).
--- ---
**Q: Received a response code 429?** **Q: Received a response code 429?**
**A:** This indicates that you have been blocked by the job board site for sending too many requests. Currently, **LinkedIn** is particularly aggressive with blocking. We recommend: **A:** This indicates that you have been blocked by the job board site for sending too many requests. Currently, *
*LinkedIn** is particularly aggressive with blocking. We recommend:
- Waiting a few seconds between requests. - Waiting a few seconds between requests.
- Trying a VPN or proxy to change your IP address. - Trying a VPN or proxy to change your IP address.
@ -174,6 +184,7 @@ You can specify the following countries when searching on Indeed (use the exact
**Q: Experiencing a "Segmentation fault: 11" on macOS Catalina?** **Q: Experiencing a "Segmentation fault: 11" on macOS Catalina?**
**A:** This is due to `tls_client` dependency not supporting your architecture. Solutions and workarounds include: **A:** This is due to `tls_client` dependency not supporting your architecture. Solutions and workarounds include:
- Upgrade to a newer version of MacOS - Upgrade to a newer version of MacOS
- Reach out to the maintainers of [tls_client](https://github.com/bogdanfinn/tls-client) for fixes - Reach out to the maintainers of [tls_client](https://github.com/bogdanfinn/tls-client) for fixes

View File

@ -29,5 +29,3 @@ print('outputted to jobs.csv')
# 4: display in Jupyter Notebook (1. pip install jupyter 2. jupyter notebook) # 4: display in Jupyter Notebook (1. pip install jupyter 2. jupyter notebook)
# display(jobs) # display(jobs)

View File

@ -49,8 +49,8 @@ def scrape_jobs(
if value_str in job_type.value: if value_str in job_type.value:
return job_type return job_type
raise Exception(f"Invalid job type: {value_str}") raise Exception(f"Invalid job type: {value_str}")
job_type = get_enum_from_value(job_type) if job_type else None
job_type = get_enum_from_value(job_type) if job_type else None
if type(site_name) == str: if type(site_name) == str:
site_type = [_map_str_to_site(site_name)] site_type = [_map_str_to_site(site_name)]
@ -162,6 +162,7 @@ def scrape_jobs(
"min_amount", "min_amount",
"max_amount", "max_amount",
"currency", "currency",
"emails",
"description", "description",
] ]
jobs_formatted_df = jobs_df[desired_order] jobs_formatted_df = jobs_df[desired_order]

View File

@ -187,6 +187,7 @@ class JobPost(BaseModel):
compensation: Optional[Compensation] = None compensation: Optional[Compensation] = None
date_posted: Optional[date] = None date_posted: Optional[date] = None
benefits: Optional[str] = None benefits: Optional[str] = None
emails: Optional[list[str]] = None
class JobResponse(BaseModel): class JobResponse(BaseModel):

View File

@ -27,6 +27,7 @@ from ...jobs import (
JobType, JobType,
) )
from .. import Scraper, ScraperInput, Site from .. import Scraper, ScraperInput, Site
from ...utils import extract_emails_from_text
class IndeedScraper(Scraper): class IndeedScraper(Scraper):
@ -138,6 +139,7 @@ class IndeedScraper(Scraper):
date_posted = date_posted.strftime("%Y-%m-%d") date_posted = date_posted.strftime("%Y-%m-%d")
description = self.get_description(job_url, session) description = self.get_description(job_url, session)
emails = extract_emails_from_text(description)
with io.StringIO(job["snippet"]) as f: with io.StringIO(job["snippet"]) as f:
soup_io = BeautifulSoup(f, "html.parser") soup_io = BeautifulSoup(f, "html.parser")
li_elements = soup_io.find_all("li") li_elements = soup_io.find_all("li")
@ -153,6 +155,7 @@ class IndeedScraper(Scraper):
state=job.get("jobLocationState"), state=job.get("jobLocationState"),
country=self.country, country=self.country,
), ),
emails=extract_emails_from_text(description),
job_type=job_type, job_type=job_type,
compensation=compensation, compensation=compensation,
date_posted=date_posted, date_posted=date_posted,

View File

@ -17,13 +17,13 @@ from threading import Lock
from .. import Scraper, ScraperInput, Site from .. import Scraper, ScraperInput, Site
from ..exceptions import LinkedInException from ..exceptions import LinkedInException
from ... import JobType
from ...jobs import ( from ...jobs import (
JobPost, JobPost,
Location, Location,
JobResponse, JobResponse,
JobType, JobType,
) )
from ...utils import extract_emails_from_text
class LinkedInScraper(Scraper): class LinkedInScraper(Scraper):
@ -162,7 +162,7 @@ class LinkedInScraper(Scraper):
benefits_tag = job_card.find("span", class_="result-benefits__text") benefits_tag = job_card.find("span", class_="result-benefits__text")
benefits = " ".join(benefits_tag.get_text().split()) if benefits_tag else None benefits = " ".join(benefits_tag.get_text().split()) if benefits_tag else None
description, job_type = self.get_job_info_page(job_url) description, job_type = self.get_job_description(job_url)
return JobPost( return JobPost(
title=title, title=title,
@ -173,9 +173,10 @@ class LinkedInScraper(Scraper):
job_url=job_url, job_url=job_url,
job_type=job_type, job_type=job_type,
benefits=benefits, benefits=benefits,
emails=extract_emails_from_text(description)
) )
def get_job_info_page(self, job_page_url: str) -> tuple[None, None] | tuple[ def get_job_description(self, job_page_url: str) -> tuple[None, None] | tuple[
str | None, tuple[str | None, JobType | None]]: str | None, tuple[str | None, JobType | None]]:
""" """
Retrieves job description by going to the job page url Retrieves job description by going to the job page url
@ -193,9 +194,9 @@ class LinkedInScraper(Scraper):
"div", class_=lambda x: x and "show-more-less-html__markup" in x "div", class_=lambda x: x and "show-more-less-html__markup" in x
) )
text_content = None description = None
if div_content: if div_content:
text_content = " ".join(div_content.get_text().split()).strip() description = " ".join(div_content.get_text().split()).strip()
def get_job_type( def get_job_type(
soup_job_type: BeautifulSoup, soup_job_type: BeautifulSoup,
@ -224,7 +225,7 @@ class LinkedInScraper(Scraper):
return LinkedInScraper.get_enum_from_value(employment_type) return LinkedInScraper.get_enum_from_value(employment_type)
return text_content, get_job_type(soup) return description, get_job_type(soup)
@staticmethod @staticmethod
def get_enum_from_value(value_str): def get_enum_from_value(value_str):

View File

@ -28,6 +28,7 @@ from ...jobs import (
JobType, JobType,
Country, Country,
) )
from ...utils import extract_emails_from_text
class ZipRecruiterScraper(Scraper): class ZipRecruiterScraper(Scraper):
@ -174,6 +175,7 @@ class ZipRecruiterScraper(Scraper):
compensation=ZipRecruiterScraper.get_compensation(job), compensation=ZipRecruiterScraper.get_compensation(job),
date_posted=date_posted, date_posted=date_posted,
job_url=job_url, job_url=job_url,
emails=extract_emails_from_text(description),
) )
return job_post return job_post
@ -465,4 +467,3 @@ class ZipRecruiterScraper(Scraper):
parsed_url = urlparse(url) parsed_url = urlparse(url)
return urlunparse((parsed_url.scheme, parsed_url.netloc, parsed_url.path, parsed_url.params, '', '')) return urlunparse((parsed_url.scheme, parsed_url.netloc, parsed_url.path, parsed_url.params, '', ''))

View File

@ -1,4 +1,5 @@
from ..jobspy import scrape_jobs from ..jobspy import scrape_jobs
import pandas as pd
def test_all(): def test_all():
@ -7,4 +8,5 @@ def test_all():
search_term="software engineer", search_term="software engineer",
results_wanted=5, results_wanted=5,
) )
assert result is not None and result.errors.empty is True
assert isinstance(result, pd.DataFrame) and not result.empty, "Result should be a non-empty DataFrame"

View File

@ -1,4 +1,5 @@
from ..jobspy import scrape_jobs from ..jobspy import scrape_jobs
import pandas as pd
def test_indeed(): def test_indeed():
@ -6,4 +7,4 @@ def test_indeed():
site_name="indeed", site_name="indeed",
search_term="software engineer", search_term="software engineer",
) )
assert result is not None and result.errors.empty is True assert isinstance(result, pd.DataFrame) and not result.empty, "Result should be a non-empty DataFrame"

View File

@ -1,4 +1,5 @@
from ..jobspy import scrape_jobs from ..jobspy import scrape_jobs
import pandas as pd
def test_linkedin(): def test_linkedin():
@ -6,4 +7,4 @@ def test_linkedin():
site_name="linkedin", site_name="linkedin",
search_term="software engineer", search_term="software engineer",
) )
assert result is not None and result.errors.empty is True assert isinstance(result, pd.DataFrame) and not result.empty, "Result should be a non-empty DataFrame"

View File

@ -1,4 +1,5 @@
from ..jobspy import scrape_jobs from ..jobspy import scrape_jobs
import pandas as pd
def test_ziprecruiter(): def test_ziprecruiter():
@ -7,4 +8,4 @@ def test_ziprecruiter():
search_term="software engineer", search_term="software engineer",
) )
assert result is not None and result.errors.empty is True assert isinstance(result, pd.DataFrame) and not result.empty, "Result should be a non-empty DataFrame"