[enh]: extract emails

2023-09-28 18:09:21 -05:00 · 2023-09-28 18:09:21 -05:00 · e4b925605d
parent c802c8c3b8
commit e4b925605d
13 changed files with 990 additions and 969 deletions
--- a/.github/workflows/publish-to-pypi.yml
+++ b/.github/workflows/publish-to-pypi.yml
@ -7,27 +7,27 @@ jobs:
    runs-on: ubuntu-latest

    steps:
-    - uses: actions/checkout@v3
-    - name: Set up Python
-      uses: actions/setup-python@v4
-      with:
-        python-version: "3.10"
+      - uses: actions/checkout@v3
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: "3.10"

-    - name: Install poetry
-      run: >-
-        python3 -m
-        pip install
-        poetry
-        --user
+      - name: Install poetry
+        run: >-
+          python3 -m
+          pip install
+          poetry
+          --user

-    - name: Build distribution 📦
-      run: >-
-        python3 -m
-        poetry
-        build
+      - name: Build distribution 📦
+        run: >-
+          python3 -m
+          poetry
+          build

-    - name: Publish distribution 📦 to PyPI
-      if: startsWith(github.ref, 'refs/tags')
-      uses: pypa/gh-action-pypi-publish@release/v1
-      with:
-        password: ${{ secrets.PYPI_API_TOKEN }}
+      - name: Publish distribution 📦 to PyPI
+        if: startsWith(github.ref, 'refs/tags')
+        uses: pypa/gh-action-pypi-publish@release/v1
+        with:
+          password: ${{ secrets.PYPI_API_TOKEN }}
--- a/README.md
+++ b/README.md
@ -4,26 +4,30 @@

 **Not technical?** Try out the web scraping tool on our site at [usejobspy.com](https://usejobspy.com).

-*Looking to build a data-focused software product?* **[Book a call](https://calendly.com/zachary-products/15min)** *to work with us.*  
+*Looking to build a data-focused software product?* **[Book a call](https://calendly.com/zachary-products/15min)** *to
+work with us.*  
 \
-Check out another project we wrote: ***[HomeHarvest](https://github.com/ZacharyHampton/HomeHarvest)** – a Python package for real estate scraping*
-## Features
+Check out another project we wrote: ***[HomeHarvest](https://github.com/ZacharyHampton/HomeHarvest)** – a Python package
+for real estate scraping*

+## Features

 - Scrapes job postings from **LinkedIn**, **Indeed** & **ZipRecruiter** simultaneously
 - Aggregates the job postings in a Pandas DataFrame
 - Proxy support (HTTP/S, SOCKS)
-  
-[Video Guide for JobSpy](https://www.youtube.com/watch?v=RuP1HrAZnxs&pp=ygUgam9icyBzY3JhcGVyIGJvdCBsaW5rZWRpbiBpbmRlZWQ%3D) - Updated for release v1.1.3
+
+[Video Guide for JobSpy](https://www.youtube.com/watch?v=RuP1HrAZnxs&pp=ygUgam9icyBzY3JhcGVyIGJvdCBsaW5rZWRpbiBpbmRlZWQ%3D) -
+Updated for release v1.1.3

 ![jobspy](https://github.com/cullenwatson/JobSpy/assets/78247585/ec7ef355-05f6-4fd3-8161-a817e31c5c57)
-  
+
 ### Installation
+
 ```
 pip install --upgrade python-jobspy
 ```
-  
-  _Python version >= [3.10](https://www.python.org/downloads/release/python-3100/) required_ 
+
+_Python version >= [3.10](https://www.python.org/downloads/release/python-3100/) required_

 ### Usage

@ -65,6 +69,7 @@ print(jobs)
 ```

 ### Output
+
 ```
 SITE           TITLE                             COMPANY_NAME      CITY          STATE  JOB_TYPE  INTERVAL  MIN_AMOUNT  MAX_AMOUNT  JOB_URL                                            DESCRIPTION
 indeed         Software Engineer                 AMERICAN SYSTEMS  Arlington     VA     None      yearly    200000      150000      https://www.indeed.com/viewjob?jk=5e409e577046...  THIS POSITION COMES WITH A 10K SIGNING BONUS!...
@ -74,7 +79,9 @@ linkedin       Full-Stack Software Engineer      Rain              New York
 zip_recruiter Software Engineer - New Grad       ZipRecruiter      Santa Monica  CA     fulltime  yearly    130000      150000      https://www.ziprecruiter.com/jobs/ziprecruiter...  We offer a hybrid work environment. Most US-ba...
 zip_recruiter Software Developer                 TEKsystems        Phoenix       AZ     fulltime  hourly    65          75          https://www.ziprecruiter.com/jobs/teksystems-0...  Top Skills' Details• 6 years of Java developme...
 ```
+
 ### Parameters for `scrape_jobs()`
+
 ```plaintext
 Required
 ├── site_type (List[enum]): linkedin, zip_recruiter, indeed
@ -91,8 +98,8 @@ Optional
 ├── offset (enum): starts the search from an offset (e.g. 25 will start the search from the 25th result)
 ```

- 
 ### JobPost Schema
+
 ```plaintext
 JobPost
 ├── title (str)
@ -113,14 +120,15 @@ JobPost
 ```

 ### Exceptions
+
 The following exceptions may be raised when using JobSpy:
+
 * `LinkedInException`
 * `IndeedException`
 * `ZipRecruiterException`

 ## Supported Countries for Job Searching

-
 ### **LinkedIn**

 LinkedIn searches globally & uses only the `location` parameter.
@ -129,43 +137,45 @@ LinkedIn searches globally & uses only the `location` parameter.

 ZipRecruiter searches for jobs in **US/Canada** & uses only the `location` parameter.

-
 ### **Indeed**
-Indeed supports most countries, but the `country_indeed` parameter is required. Additionally, use the `location` parameter to narrow down the location, e.g. city & state if necessary.

-You can specify the following countries when searching on Indeed (use the exact name): 
+Indeed supports most countries, but the `country_indeed` parameter is required. Additionally, use the `location`
+parameter to narrow down the location, e.g. city & state if necessary.

+You can specify the following countries when searching on Indeed (use the exact name):

-|      |      |      |      |
-|------|------|------|------|
-| Argentina | Australia | Austria | Bahrain |
-| Belgium | Brazil | Canada | Chile |
-| China | Colombia | Costa Rica | Czech Republic |
-| Denmark | Ecuador | Egypt | Finland |
-| France | Germany | Greece | Hong Kong |
-| Hungary | India | Indonesia | Ireland |
-| Israel | Italy | Japan | Kuwait |
-| Luxembourg | Malaysia | Mexico | Morocco |
-| Netherlands | New Zealand | Nigeria | Norway |
-| Oman | Pakistan | Panama | Peru |
-| Philippines | Poland | Portugal | Qatar |
-| Romania | Saudi Arabia | Singapore | South Africa |
-| South Korea | Spain | Sweden | Switzerland |
-| Taiwan | Thailand | Turkey | Ukraine |
-| United Arab Emirates | UK | USA | Uruguay |
-| Venezuela | Vietnam |  |  |
+|                      |              |            |                |
+|----------------------|--------------|------------|----------------|
+| Argentina            | Australia    | Austria    | Bahrain        |
+| Belgium              | Brazil       | Canada     | Chile          |
+| China                | Colombia     | Costa Rica | Czech Republic |
+| Denmark              | Ecuador      | Egypt      | Finland        |
+| France               | Germany      | Greece     | Hong Kong      |
+| Hungary              | India        | Indonesia  | Ireland        |
+| Israel               | Italy        | Japan      | Kuwait         |
+| Luxembourg           | Malaysia     | Mexico     | Morocco        |
+| Netherlands          | New Zealand  | Nigeria    | Norway         |
+| Oman                 | Pakistan     | Panama     | Peru           |
+| Philippines          | Poland       | Portugal   | Qatar          |
+| Romania              | Saudi Arabia | Singapore  | South Africa   |
+| South Korea          | Spain        | Sweden     | Switzerland    |
+| Taiwan               | Thailand     | Turkey     | Ukraine        |
+| United Arab Emirates | UK           | USA        | Uruguay        |
+| Venezuela            | Vietnam      |            |                |

 ## Frequently Asked Questions

 ---

 **Q: Encountering issues with your queries?**  
-**A:** Try reducing the number of `results_wanted` and/or broadening the filters. If problems persist, [submit an issue](https://github.com/cullenwatson/JobSpy/issues).
+**A:** Try reducing the number of `results_wanted` and/or broadening the filters. If problems
+persist, [submit an issue](https://github.com/cullenwatson/JobSpy/issues).

 ---

 **Q: Received a response code 429?**  
-**A:** This indicates that you have been blocked by the job board site for sending too many requests. Currently, **LinkedIn** is particularly aggressive with blocking. We recommend:
+**A:** This indicates that you have been blocked by the job board site for sending too many requests. Currently, *
+*LinkedIn** is particularly aggressive with blocking. We recommend:

 - Waiting a few seconds between requests.
 - Trying a VPN or proxy to change your IP address.
@ -174,6 +184,7 @@ You can specify the following countries when searching on Indeed (use the exact

 **Q: Experiencing a "Segmentation fault: 11" on macOS Catalina?**  
 **A:** This is due to `tls_client` dependency not supporting your architecture. Solutions and workarounds include:
+
 - Upgrade to a newer version of MacOS
 - Reach out to the maintainers of [tls_client](https://github.com/bogdanfinn/tls-client) for fixes

--- a/examples/JobSpy_Demo.py
+++ b/examples/JobSpy_Demo.py
@ -5,9 +5,9 @@ jobs: pd.DataFrame = scrape_jobs(
    site_name=["indeed", "linkedin", "zip_recruiter"],
    search_term="software engineer",
    location="Dallas, TX",
-    results_wanted=50, # be wary the higher it is, the more likey you'll get blocked (rotating proxy should work tho)
+    results_wanted=50,  # be wary the higher it is, the more likey you'll get blocked (rotating proxy should work tho)
    country_indeed='USA',
-    offset=25 # start jobs from an offset (use if search failed and want to continue)
+    offset=25  # start jobs from an offset (use if search failed and want to continue)
    # proxy="http://jobspy:5a4vpWtj8EeJ2hoYzk@ca.smartproxy.com:20001",
 )

@ -29,5 +29,3 @@ print('outputted to jobs.csv')

 # 4: display in Jupyter Notebook (1. pip install jupyter 2. jupyter notebook)
 # display(jobs)
-
-
--- a/poetry.lock
+++ b/poetry.lock
--- a/src/jobspy/init.py
+++ b/src/jobspy/init.py
@ -26,18 +26,18 @@ def _map_str_to_site(site_name: str) -> Site:


 def scrape_jobs(
-    site_name: str | List[str] | Site | List[Site],
-    search_term: str,
-    location: str = "",
-    distance: int = None,
-    is_remote: bool = False,
-    job_type: str = None,
-    easy_apply: bool = False,  # linkedin
-    results_wanted: int = 15,
-    country_indeed: str = "usa",
-    hyperlinks: bool = False,
-    proxy: Optional[str] = None,
-    offset: Optional[int] = 0
+        site_name: str | List[str] | Site | List[Site],
+        search_term: str,
+        location: str = "",
+        distance: int = None,
+        is_remote: bool = False,
+        job_type: str = None,
+        easy_apply: bool = False,  # linkedin
+        results_wanted: int = 15,
+        country_indeed: str = "usa",
+        hyperlinks: bool = False,
+        proxy: Optional[str] = None,
+        offset: Optional[int] = 0
 ) -> pd.DataFrame:
    """
    Simultaneously scrapes job data from multiple job sites.
@ -49,8 +49,8 @@ def scrape_jobs(
            if value_str in job_type.value:
                return job_type
        raise Exception(f"Invalid job type: {value_str}")
-    job_type = get_enum_from_value(job_type) if job_type else None

+    job_type = get_enum_from_value(job_type) if job_type else None

    if type(site_name) == str:
        site_type = [_map_str_to_site(site_name)]
@ -162,6 +162,7 @@ def scrape_jobs(
            "min_amount",
            "max_amount",
            "currency",
+            "emails",
            "description",
        ]
        jobs_formatted_df = jobs_df[desired_order]
--- a/src/jobspy/jobs/init.py
+++ b/src/jobspy/jobs/init.py
@ -187,6 +187,7 @@ class JobPost(BaseModel):
    compensation: Optional[Compensation] = None
    date_posted: Optional[date] = None
    benefits: Optional[str] = None
+    emails: Optional[list[str]] = None


 class JobResponse(BaseModel):
--- a/src/jobspy/scrapers/indeed/init.py
+++ b/src/jobspy/scrapers/indeed/init.py
@ -27,6 +27,7 @@ from ...jobs import (
    JobType,
 )
 from .. import Scraper, ScraperInput, Site
+from ...utils import extract_emails_from_text


 class IndeedScraper(Scraper):
@ -138,6 +139,7 @@ class IndeedScraper(Scraper):
            date_posted = date_posted.strftime("%Y-%m-%d")

            description = self.get_description(job_url, session)
+            emails = extract_emails_from_text(description)
            with io.StringIO(job["snippet"]) as f:
                soup_io = BeautifulSoup(f, "html.parser")
                li_elements = soup_io.find_all("li")
@ -153,6 +155,7 @@ class IndeedScraper(Scraper):
                    state=job.get("jobLocationState"),
                    country=self.country,
                ),
+                emails=extract_emails_from_text(description),
                job_type=job_type,
                compensation=compensation,
                date_posted=date_posted,
--- a/src/jobspy/scrapers/linkedin/init.py
+++ b/src/jobspy/scrapers/linkedin/init.py
@ -17,13 +17,13 @@ from threading import Lock

 from .. import Scraper, ScraperInput, Site
 from ..exceptions import LinkedInException
-from ... import JobType
 from ...jobs import (
    JobPost,
    Location,
    JobResponse,
    JobType,
 )
+from ...utils import extract_emails_from_text


 class LinkedInScraper(Scraper):
@ -162,7 +162,7 @@ class LinkedInScraper(Scraper):
        benefits_tag = job_card.find("span", class_="result-benefits__text")
        benefits = " ".join(benefits_tag.get_text().split()) if benefits_tag else None

-        description, job_type = self.get_job_info_page(job_url)
+        description, job_type = self.get_job_description(job_url)

        return JobPost(
            title=title,
@ -173,9 +173,10 @@ class LinkedInScraper(Scraper):
            job_url=job_url,
            job_type=job_type,
            benefits=benefits,
+            emails=extract_emails_from_text(description)
        )

-    def get_job_info_page(self, job_page_url: str) -> tuple[None, None] | tuple[
+    def get_job_description(self, job_page_url: str) -> tuple[None, None] | tuple[
        str | None, tuple[str | None, JobType | None]]:
        """
        Retrieves job description by going to the job page url
@ -193,9 +194,9 @@ class LinkedInScraper(Scraper):
            "div", class_=lambda x: x and "show-more-less-html__markup" in x
        )

-        text_content = None
+        description = None
        if div_content:
-            text_content = " ".join(div_content.get_text().split()).strip()
+            description = " ".join(div_content.get_text().split()).strip()

        def get_job_type(
                soup_job_type: BeautifulSoup,
@ -224,7 +225,7 @@ class LinkedInScraper(Scraper):

            return LinkedInScraper.get_enum_from_value(employment_type)

-        return text_content, get_job_type(soup)
+        return description, get_job_type(soup)

    @staticmethod
    def get_enum_from_value(value_str):
--- a/src/jobspy/scrapers/ziprecruiter/init.py
+++ b/src/jobspy/scrapers/ziprecruiter/init.py
@ -28,6 +28,7 @@ from ...jobs import (
    JobType,
    Country,
 )
+from ...utils import extract_emails_from_text


 class ZipRecruiterScraper(Scraper):
@ -174,6 +175,7 @@ class ZipRecruiterScraper(Scraper):
            compensation=ZipRecruiterScraper.get_compensation(job),
            date_posted=date_posted,
            job_url=job_url,
+            emails=extract_emails_from_text(description),
        )
        return job_post

@ -465,4 +467,3 @@ class ZipRecruiterScraper(Scraper):
        parsed_url = urlparse(url)

        return urlunparse((parsed_url.scheme, parsed_url.netloc, parsed_url.path, parsed_url.params, '', ''))
-
--- a/src/tests/test_all.py
+++ b/src/tests/test_all.py
@ -1,4 +1,5 @@
 from ..jobspy import scrape_jobs
+import pandas as pd


 def test_all():
@ -7,4 +8,5 @@ def test_all():
        search_term="software engineer",
        results_wanted=5,
    )
-    assert result is not None and result.errors.empty is True
+
+    assert isinstance(result, pd.DataFrame) and not result.empty, "Result should be a non-empty DataFrame"
--- a/src/tests/test_indeed.py
+++ b/src/tests/test_indeed.py
@ -1,4 +1,5 @@
 from ..jobspy import scrape_jobs
+import pandas as pd


 def test_indeed():
@ -6,4 +7,4 @@ def test_indeed():
        site_name="indeed",
        search_term="software engineer",
    )
-    assert result is not None and result.errors.empty is True
+    assert isinstance(result, pd.DataFrame) and not result.empty, "Result should be a non-empty DataFrame"
--- a/src/tests/test_linkedin.py
+++ b/src/tests/test_linkedin.py
@ -1,4 +1,5 @@
 from ..jobspy import scrape_jobs
+import pandas as pd


 def test_linkedin():
@ -6,4 +7,4 @@ def test_linkedin():
        site_name="linkedin",
        search_term="software engineer",
    )
-    assert result is not None and result.errors.empty is True
+    assert isinstance(result, pd.DataFrame) and not result.empty, "Result should be a non-empty DataFrame"
--- a/src/tests/test_ziprecruiter.py
+++ b/src/tests/test_ziprecruiter.py
@ -1,4 +1,5 @@
 from ..jobspy import scrape_jobs
+import pandas as pd


 def test_ziprecruiter():
@ -7,4 +8,4 @@ def test_ziprecruiter():
        search_term="software engineer",
    )

-    assert result is not None and result.errors.empty is True
+    assert isinstance(result, pd.DataFrame) and not result.empty, "Result should be a non-empty DataFrame"