Thread sites (#40 )

reduce size of jupyter notebook
Update README.md
2026-03-04 19:44:30 -08:00 · 2023-09-06 09:47:11 -05:00 · 2023-09-05 13:09:18 -05:00 · 2023-09-05 13:03:32 -05:00 · 2023-09-05 12:27:00 -05:00 · 2023-09-05 12:17:22 -05:00
13 changed files with 1520 additions and 781 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -3,6 +3,7 @@
 /venv/
 /ven/
 **/__pycache__/
 **/.pytest_cache/
 *.pyc
 .env
 dist
--- a/JobSpy_Demo.ipynb
+++ b/JobSpy_Demo.ipynb
--- a/README.md
+++ b/README.md
@@ -1,13 +1,18 @@
-# <img src="https://github.com/cullenwatson/JobSpy/assets/78247585/2f61a059-9647-4a9c-bfb9-e3a9448bdc6a" style="vertical-align: sub; margin-right: 5px;"> JobSpy
+<img src="https://github.com/cullenwatson/JobSpy/assets/78247585/ae185b7e-e444-4712-8bb9-fa97f53e896b" width="400">
 **JobSpy** is a simple, yet comprehensive, job scraping library.
 ## Features
 - Scrapes job postings from **LinkedIn**, **Indeed** & **ZipRecruiter** simultaneously
 - Aggregates the job postings in a Pandas DataFrame
-
+  
 ![jobspy](https://github.com/cullenwatson/JobSpy/assets/78247585/ec7ef355-05f6-4fd3-8161-a817e31c5c57)
 ### Installation
-`pip install python-jobspy`  
+```
 pip install python-jobspy
 ```
  _Python version >= [3.10](https://www.python.org/downloads/release/python-3100/) required_ 
@@ -20,24 +25,28 @@ import pandas as pd
 jobs: pd.DataFrame = scrape_jobs(
    site_name=["indeed", "linkedin", "zip_recruiter"],
    search_term="software engineer",
-    results_wanted=10
+    location="Dallas, TX",
    results_wanted=10,
    country='USA' # only needed for indeed
 )
 if jobs.empty:
    print("No jobs found.")
 else:
    # 1 print
    pd.set_option('display.max_columns', None)
    pd.set_option('display.max_rows', None)
    pd.set_option('display.width', None)
    pd.set_option('display.max_colwidth', 50)  # set to 0 to see full job url / desc
    #1 output
    print(jobs)
-    # 2 display in Jupyter Notebook
+    #2 display in Jupyter Notebook
-    # display(jobs)
+    #display(jobs)
-    # 3 output to csv
+    #3 output to .csv
-    # jobs.to_csv('jobs.csv', index=False)
+    #jobs.to_csv('jobs.csv', index=False)
 ```
 ### Output
@@ -51,8 +60,6 @@ zip_recruiter Software Engineer - New Grad       ZipRecruiter      Santa Monica
 zip_recruiter Software Developer                 TEKsystems        Phoenix       AZ     fulltime  hourly    65          75          https://www.ziprecruiter.com/jobs/teksystems-0...  Top Skills' Details• 6 years of Java developme...
 ```
 ### Parameters for `scrape_jobs()`
 ```plaintext
 Required
 ├── site_type (List[enum]): linkedin, zip_recruiter, indeed
@@ -63,36 +70,85 @@ Optional
 ├── job_type (enum): fulltime, parttime, internship, contract
 ├── is_remote (bool)
 ├── results_wanted (int): number of job results to retrieve for each site specified in 'site_type'
-├── easy_apply (bool): filters for jobs on LinkedIn that have the 'Easy Apply' option
+├── easy_apply (bool): filters for jobs that are hosted on LinkedIn
 ├── country (enum): filters the country on Indeed
 ```
 ### JobPost Schema
 ```plaintext
 JobPost
 ├── title (str)
-├── company_name (str)
+├── company (str)
 ├── job_url (str)
 ├── location (object)
 │   ├── country (str)
 │   ├── city (str)
 │   ├── state (str)
 ├── description (str)
-├── job_type (enum)
+├── job_type (enum): fulltime, parttime, internship, contract
 ├── compensation (object)
-│   ├── interval (CompensationInterval): yearly, monthly, weekly, daily, hourly
+│   ├── interval (enum): yearly, monthly, weekly, daily, hourly
-│   ├── min_amount (float)
+│   ├── min_amount (int)
-│   ├── max_amount (float)
+│   ├── max_amount (int)
-│   └── currency (str)
+│   └── currency (enum)
-└── date_posted (datetime)
+└── date_posted (date)
 ```
 ## Supported Countries for Job Searching
 ### **LinkedIn**
 LinkedIn searches globally & uses only the `location` parameter
 ### **ZipRecruiter**
 ZipRecruiter searches for jobs in US/Canada & uses only the `location` parameter
 ### **Indeed**
 For Indeed, the `country` parameter is required. Additionally, use the `location` parameter and include the city or state if necessary.
 You can specify the following countries when searching on Indeed (use the exact name): 
 |      |      |      |      |
 |------|------|------|------|
 | Argentina | Australia | Austria | Bahrain |
 | Belgium | Brazil | Canada | Chile |
 | China | Colombia | Costa Rica | Czech Republic |
 | Denmark | Ecuador | Egypt | Finland |
 | France | Germany | Greece | Hong Kong |
 | Hungary | India | Indonesia | Ireland |
 | Israel | Italy | Japan | Kuwait |
 | Luxembourg | Malaysia | Mexico | Morocco |
 | Netherlands | New Zealand | Nigeria | Norway |
 | Oman | Pakistan | Panama | Peru |
 | Philippines | Poland | Portugal | Qatar |
 | Romania | Saudi Arabia | Singapore | South Africa |
 | South Korea | Spain | Sweden | Switzerland |
 | Taiwan | Thailand | Turkey | Ukraine |
 | United Arab Emirates | UK | USA | Uruguay |
 | Venezuela | Vietnam |  |  |
 ## Frequently Asked Questions
 ---
 **Q: Encountering issues with your queries?**  
 **A:** Try reducing the number of `results_wanted` and/or broadening the filters. If problems persist, [submit an issue](#).
 ---
 **Q: Received a response code 429?**  
 **A:** This indicates that you have been blocked by the job board site for sending too many requests. Currently, **ZipRecruiter** is particularly aggressive with blocking. We recommend:
 - Waiting a few seconds between requests.
 - Trying a VPN to change your IP address.
 **Note:** Proxy support is in development and coming soon!
 ---
 ### FAQ
 #### Encountering issues with your queries?
 Try reducing the number of `results_wanted` and/or broadening the filters. If problems persist, please submit an issue.
 #### Received a response code 429?
 You have been blocked by the job board site for sending too many requests. ZipRecruiter seems to be the most aggressive at the moment. Consider waiting a few seconds, or try using a VPN. Proxy support coming soon.
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "python-jobspy"
-version = "1.0.1"
+version = "1.1.1"
 description = "Job scraper for LinkedIn, Indeed & ZipRecruiter"
 authors = ["Zachary Hampton <zachary@zacharysproducts.com>", "Cullen Watson <cullen@cullen.ai>"]
 readme = "README.md"
@@ -24,4 +24,4 @@ jupyter = "^1.0.0"
 [build-system]
 requires = ["poetry-core"]
-build-backend = "poetry.core.masonry.api"
+build-backend = "poetry.core.masonry.api"
--- a/src/jobspy/core/init.py
+++ b/src/jobspy/core/init.py
--- a/src/jobspy/init.py
+++ b/src/jobspy/init.py
@@ -1,16 +1,13 @@
 import pandas as pd
-from typing import List, Tuple
+import concurrent.futures
 from concurrent.futures import ThreadPoolExecutor
 from typing import List, Tuple, NamedTuple, Dict
-from .jobs import JobType
+from .jobs import JobType, Location
 from .scrapers.indeed import IndeedScraper
 from .scrapers.ziprecruiter import ZipRecruiterScraper
 from .scrapers.linkedin import LinkedInScraper
-from .scrapers import (
+from .scrapers import ScraperInput, Site, JobResponse, Country
    ScraperInput,
    Site,
    JobResponse,
 )
 SCRAPER_MAPPING = {
    Site.LINKEDIN: LinkedInScraper,
@@ -19,21 +16,27 @@ SCRAPER_MAPPING = {
 }
 class ScrapeResults(NamedTuple):
    jobs: pd.DataFrame
    errors: pd.DataFrame
 def _map_str_to_site(site_name: str) -> Site:
    return Site[site_name.upper()]
 def scrape_jobs(
-        site_name: str | Site | List[Site],
+    site_name: str | Site | List[Site],
-        search_term: str,
+    search_term: str,
-
+    location: str = "",
-        location: str = "",
+    distance: int = None,
-        distance: int = None,
+    is_remote: bool = False,
-        is_remote: bool = False,
+    job_type: JobType = None,
-        job_type: JobType = None,
+    easy_apply: bool = False,  # linkedin
-        easy_apply: bool = False,  # linkedin
+    results_wanted: int = 15,
-        results_wanted: int = 15
+    country_indeed: str = "usa",
-) -> pd.DataFrame:
+    hyperlinks: bool = False
 ) -> ScrapeResults:
    """
    Asynchronously scrapes job data from multiple job sites.
    :return: results_wanted: pandas dataframe containing job data
@@ -42,9 +45,12 @@ def scrape_jobs(
    if type(site_name) == str:
        site_name = _map_str_to_site(site_name)
    country_enum = Country.from_string(country_indeed)
    site_type = [site_name] if type(site_name) == Site else site_name
    scraper_input = ScraperInput(
        site_type=site_type,
        country=country_enum,
        search_term=search_term,
        location=location,
        distance=distance,
@@ -55,64 +61,100 @@ def scrape_jobs(
    )
    def scrape_site(site: Site) -> Tuple[str, JobResponse]:
-        scraper_class = SCRAPER_MAPPING[site]
+        try:
-        scraper = scraper_class()
+            scraper_class = SCRAPER_MAPPING[site]
-        scraped_data: JobResponse = scraper.scrape(scraper_input)
+            scraper = scraper_class()
-
+            scraped_data: JobResponse = scraper.scrape(scraper_input)
        except Exception as e:
            scraped_data = JobResponse(jobs=[], error=str(e), success=False)
        return site.value, scraped_data
-    results = {}
+    results, errors = {}, {}
-    for site in scraper_input.site_type:
+
    def worker(site):
        site_value, scraped_data = scrape_site(site)
-        results[site_value] = scraped_data
+        return site_value, scraped_data
    with ThreadPoolExecutor() as executor:
        future_to_site = {executor.submit(worker, site): site for site in scraper_input.site_type}
        for future in concurrent.futures.as_completed(future_to_site):
            site_value, scraped_data = future.result()
            results[site_value] = scraped_data
            if scraped_data.error:
                errors[site_value] = scraped_data.error
    dfs = []
    for site, job_response in results.items():
        for job in job_response.jobs:
            data = job.dict()
-            data['site'] = site
+            data["job_url_hyper"] = f'<a href="{data["job_url"]}">{data["job_url"]}</a>'
-
+            data["site"] = site
-            # Formatting JobType
+            data["company"] = data["company_name"]
-            data['job_type'] = data['job_type'].value if data['job_type'] else None
+            if data["job_type"]:
-
+                # Take the first value from the job type tuple
-            # Formatting Location
+                data["job_type"] = data["job_type"].value[0]
            location_obj = data.get('location')
            if location_obj and isinstance(location_obj, dict):
                data['city'] = location_obj.get('city', '')
                data['state'] = location_obj.get('state', '')
                data['country'] = location_obj.get('country', 'USA')
            else:
-                data['city'] = None
+                data["job_type"] = None
                data['state'] = None
                data['country'] = None
-            # Formatting Compensation
+            data["location"] = Location(**data["location"]).display_location()
-            compensation_obj = data.get('compensation')
+
            compensation_obj = data.get("compensation")
            if compensation_obj and isinstance(compensation_obj, dict):
-                data['interval'] = compensation_obj.get('interval').value if compensation_obj.get('interval') else None
+                data["interval"] = (
-                data['min_amount'] = compensation_obj.get('min_amount')
+                    compensation_obj.get("interval").value
-                data['max_amount'] = compensation_obj.get('max_amount')
+                    if compensation_obj.get("interval")
-                data['currency'] = compensation_obj.get('currency', 'USD')
+                    else None
                )
                data["min_amount"] = compensation_obj.get("min_amount")
                data["max_amount"] = compensation_obj.get("max_amount")
                data["currency"] = compensation_obj.get("currency", "USD")
            else:
-                data['interval'] = None
+                data["interval"] = None
-                data['min_amount'] = None
+                data["min_amount"] = None
-                data['max_amount'] = None
+                data["max_amount"] = None
-                data['currency'] = None
+                data["currency"] = None
            job_df = pd.DataFrame([data])
            dfs.append(job_df)
    errors_list = [(key, value) for key, value in errors.items()]
    errors_df = pd.DataFrame(errors_list, columns=["Site", "Error"])
    if dfs:
        df = pd.concat(dfs, ignore_index=True)
-        desired_order = ['site', 'title', 'company_name', 'city', 'state','job_type',
+        if hyperlinks:
-                         'interval', 'min_amount', 'max_amount',  'job_url', 'description',]
+            desired_order = [
                "site",
                "title",
                "company",
                "location",
                "job_type",
                "interval",
                "min_amount",
                "max_amount",
                "currency",
                "job_url_hyper",
                "description",
            ]
        else:
            desired_order = [
                "site",
                "title",
                "company",
                "location",
                "job_type",
                "interval",
                "min_amount",
                "max_amount",
                "currency",
                "job_url",
                "description",
            ]
        df = df[desired_order]
    else:
        df = pd.DataFrame()
-    return df
+    return ScrapeResults(jobs=df, errors=errors_df)
--- a/src/jobspy/jobs/init.py
+++ b/src/jobspy/jobs/init.py
@@ -6,25 +6,160 @@ from pydantic import BaseModel, validator
 class JobType(Enum):
-    FULL_TIME = "fulltime"
+    FULL_TIME = (
-    PART_TIME = "parttime"
+        "fulltime",
-    CONTRACT = "contract"
+        "períodointegral",
-    TEMPORARY = "temporary"
+        "estágio/trainee",
-    INTERNSHIP = "internship"
+        "cunormăîntreagă",
        "tiempocompleto",
        "vollzeit",
        "voltijds",
        "tempointegral",
        "全职",
        "plnýúvazek",
        "fuldtid",
        "دوامكامل",
        "kokopäivätyö",
        "tempsplein",
        "vollzeit",
        "πλήρηςαπασχόληση",
        "teljesmunkaidő",
        "tempopieno",
        "tempsplein",
        "heltid",
        "jornadacompleta",
        "pełnyetat",
        "정규직",
        "100%",
        "全職",
        "งานประจำ",
        "tamzamanlı",
        "повназайнятість",
        "toànthờigian",
    )
    PART_TIME = ("parttime", "teilzeit")
    CONTRACT = ("contract", "contractor")
    TEMPORARY = ("temporary",)
    INTERNSHIP = ("internship", "prácticas", "ojt(onthejobtraining)", "praktikum")
-    PER_DIEM = "perdiem"
+    PER_DIEM = ("perdiem",)
-    NIGHTS = "nights"
+    NIGHTS = ("nights",)
-    OTHER = "other"
+    OTHER = ("other",)
-    SUMMER = "summer"
+    SUMMER = ("summer",)
-    VOLUNTEER = "volunteer"
+    VOLUNTEER = ("volunteer",)
 class Country(Enum):
    ARGENTINA = ("argentina", "ar")
    AUSTRALIA = ("australia", "au")
    AUSTRIA = ("austria", "at")
    BAHRAIN = ("bahrain", "bh")
    BELGIUM = ("belgium", "be")
    BRAZIL = ("brazil", "br")
    CANADA = ("canada", "ca")
    CHILE = ("chile", "cl")
    CHINA = ("china", "cn")
    COLOMBIA = ("colombia", "co")
    COSTARICA = ("costa rica", "cr")
    CZECHREPUBLIC = ("czech republic", "cz")
    DENMARK = ("denmark", "dk")
    ECUADOR = ("ecuador", "ec")
    EGYPT = ("egypt", "eg")
    FINLAND = ("finland", "fi")
    FRANCE = ("france", "fr")
    GERMANY = ("germany", "de")
    GREECE = ("greece", "gr")
    HONGKONG = ("hong kong", "hk")
    HUNGARY = ("hungary", "hu")
    INDIA = ("india", "in")
    INDONESIA = ("indonesia", "id")
    IRELAND = ("ireland", "ie")
    ISRAEL = ("israel", "il")
    ITALY = ("italy", "it")
    JAPAN = ("japan", "jp")
    KUWAIT = ("kuwait", "kw")
    LUXEMBOURG = ("luxembourg", "lu")
    MALAYSIA = ("malaysia", "malaysia")
    MEXICO = ("mexico", "mx")
    MOROCCO = ("morocco", "ma")
    NETHERLANDS = ("netherlands", "nl")
    NEWZEALAND = ("new zealand", "nz")
    NIGERIA = ("nigeria", "ng")
    NORWAY = ("norway", "no")
    OMAN = ("oman", "om")
    PAKISTAN = ("pakistan", "pk")
    PANAMA = ("panama", "pa")
    PERU = ("peru", "pe")
    PHILIPPINES = ("philippines", "ph")
    POLAND = ("poland", "pl")
    PORTUGAL = ("portugal", "pt")
    QATAR = ("qatar", "qa")
    ROMANIA = ("romania", "ro")
    SAUDIARABIA = ("saudi arabia", "sa")
    SINGAPORE = ("singapore", "sg")
    SOUTHAFRICA = ("south africa", "za")
    SOUTHKOREA = ("south korea", "kr")
    SPAIN = ("spain", "es")
    SWEDEN = ("sweden", "se")
    SWITZERLAND = ("switzerland", "ch")
    TAIWAN = ("taiwan", "tw")
    THAILAND = ("thailand", "th")
    TURKEY = ("turkey", "tr")
    UKRAINE = ("ukraine", "ua")
    UNITEDARABEMIRATES = ("united arab emirates", "ae")
    UK = ("uk", "uk")
    USA = ("usa", "www")
    URUGUAY = ("uruguay", "uy")
    VENEZUELA = ("venezuela", "ve")
    VIETNAM = ("vietnam", "vn")
    # internal for ziprecruiter
    US_CANADA = ("usa/ca", "www")
    # internal for linkeind
    WORLDWIDE = ("worldwide", "www")
    def __new__(cls, country, domain):
        obj = object.__new__(cls)
        obj._value_ = country
        obj.domain = domain
        return obj
    @property
    def domain_value(self):
        return self.domain
    @classmethod
    def from_string(cls, country_str: str):
        """Convert a string to the corresponding Country enum."""
        country_str = country_str.strip().lower()
        for country in cls:
            if country.value == country_str:
                return country
        valid_countries = [country.value for country in cls]
        raise ValueError(
            f"Invalid country string: '{country_str}'. Valid countries (only include this param for Indeed) are: {', '.join(valid_countries)}"
        )
 class Location(BaseModel):
-    country: str = "USA"
+    country: Country = None
-    city: str = None
+    city: Optional[str] = None
    state: Optional[str] = None
    def display_location(self) -> str:
        location_parts = []
        if self.city:
            location_parts.append(self.city)
        if self.state:
            location_parts.append(self.state)
        if self.country and self.country not in (Country.US_CANADA, Country.WORLDWIDE):
            if self.country.value in ("usa", "uk"):
                location_parts.append(self.country.value.upper())
            else:
                location_parts.append(self.country.value.title())
        return ", ".join(location_parts)
 class CompensationInterval(Enum):
    YEARLY = "yearly"
@@ -38,7 +173,7 @@ class Compensation(BaseModel):
    interval: CompensationInterval
    min_amount: int = None
    max_amount: int = None
-    currency: str = "USD"
+    currency: Optional[str] = "USD"
 class JobPost(BaseModel):
@@ -47,10 +182,10 @@ class JobPost(BaseModel):
    job_url: str
    location: Optional[Location]
-    description: str = None
+    description: Optional[str] = None
    job_type: Optional[JobType] = None
    compensation: Optional[Compensation] = None
-    date_posted: date = None
+    date_posted: Optional[date] = None
 class JobResponse(BaseModel):
--- a/src/jobspy/scrapers/init.py
+++ b/src/jobspy/scrapers/init.py
@@ -1,5 +1,5 @@
-from ..jobs import Enum, BaseModel, JobType, JobResponse
+from ..jobs import Enum, BaseModel, JobType, JobResponse, Country
-from typing import List, Dict, Optional, Any
+from typing import List, Optional, Any
 class StatusException(Exception):
@@ -18,6 +18,7 @@ class ScraperInput(BaseModel):
    search_term: str
    location: str = None
    country: Optional[Country] = Country.USA
    distance: Optional[int] = None
    is_remote: bool = False
    job_type: Optional[JobType] = None
@@ -26,18 +27,9 @@ class ScraperInput(BaseModel):
    results_wanted: int = 15
 class CommonResponse(BaseModel):
    status: Optional[str]
    error: Optional[str]
    linkedin: Optional[Any] = None
    indeed: Optional[Any] = None
    zip_recruiter: Optional[Any] = None
 class Scraper:
-    def __init__(self, site: Site, url: str):
+    def __init__(self, site: Site):
        self.site = site
        self.url = url
    def scrape(self, scraper_input: ScraperInput) -> JobResponse:
        ...
--- a/src/jobspy/scrapers/indeed/init.py
+++ b/src/jobspy/scrapers/indeed/init.py
@@ -1,9 +1,10 @@
 import re
 import sys
 import math
 import io
 import json
 import traceback
 from datetime import datetime
-from typing import Optional, Tuple, List
+from typing import Optional
 import tls_client
 import urllib.parse
@@ -11,8 +12,15 @@ from bs4 import BeautifulSoup
 from bs4.element import Tag
 from concurrent.futures import ThreadPoolExecutor, Future
-from ...jobs import JobPost, Compensation, CompensationInterval, Location, JobResponse, JobType
+from ...jobs import (
-from .. import Scraper, ScraperInput, Site, StatusException
+    JobPost,
    Compensation,
    CompensationInterval,
    Location,
    JobResponse,
    JobType,
 )
 from .. import Scraper, ScraperInput, Site, Country, StatusException
 class ParsingException(Exception):
@@ -25,8 +33,7 @@ class IndeedScraper(Scraper):
        Initializes IndeedScraper with the Indeed job search url
        """
        site = Site(Site.INDEED)
-        url = "https://www.indeed.com"
+        super().__init__(site)
        super().__init__(site, url)
        self.jobs_per_page = 15
        self.seen_urls = set()
@@ -41,16 +48,21 @@ class IndeedScraper(Scraper):
        :param session:
        :return: jobs found on page, total number of jobs found for search
        """
        self.country = scraper_input.country
        domain = self.country.domain_value
        self.url = f"https://{domain}.indeed.com"
        job_list = []
        params = {
            "q": scraper_input.search_term,
            "l": scraper_input.location,
            "radius": scraper_input.distance,
            "filter": 0,
            "start": 0 + page * 10,
        }
        if scraper_input.distance:
            params["radius"] = scraper_input.distance
        sc_values = []
        if scraper_input.is_remote:
            sc_values.append("attr(DSQF7)")
@@ -59,15 +71,15 @@ class IndeedScraper(Scraper):
        if sc_values:
            params["sc"] = "0kf:" + "".join(sc_values) + ";"
-        response = session.get(self.url + "/jobs", params=params)
+        response = session.get(self.url + "/jobs", params=params, allow_redirects=True)
        # print(response.status_code)
-        if (
+        if response.status_code not in range(200, 400):
            response.status_code != 200
            and response.status_code != 307
        ):
            raise StatusException(response.status_code)
        soup = BeautifulSoup(response.content, "html.parser")
        with open("text2.html", "w", encoding="utf-8") as f:
            f.write(str(soup))
        if "did not match any jobs" in str(soup):
            raise ParsingException("Search did not match any jobs")
@@ -89,8 +101,6 @@ class IndeedScraper(Scraper):
            if job_url in self.seen_urls:
                return None
            snippet_html = BeautifulSoup(job["snippet"], "html.parser")
            extracted_salary = job.get("extractedSalary")
            compensation = None
            if extracted_salary:
@@ -115,11 +125,12 @@ class IndeedScraper(Scraper):
            date_posted = date_posted.strftime("%Y-%m-%d")
            description = self.get_description(job_url, session)
-            li_elements = snippet_html.find_all("li")
+            with io.StringIO(job["snippet"]) as f:
-            if description is None and li_elements:
+                soup = BeautifulSoup(f, "html.parser")
-                description = " ".join(li.text for li in li_elements)
+                li_elements = soup.find_all("li")
                if description is None and li_elements:
                    description = " ".join(li.text for li in li_elements)
            first_li = snippet_html.find("li")
            job_post = JobPost(
                title=job["normTitle"],
                description=description,
@@ -127,6 +138,7 @@ class IndeedScraper(Scraper):
                location=Location(
                    city=job.get("jobLocationCity"),
                    state=job.get("jobLocationState"),
                    country=self.country,
                ),
                job_type=job_type,
                compensation=compensation,
@@ -135,9 +147,11 @@ class IndeedScraper(Scraper):
            )
            return job_post
-        with ThreadPoolExecutor(max_workers=10) as executor:
+        with ThreadPoolExecutor(max_workers=1) as executor:
-            job_results: list[Future] = [executor.submit(process_job, job) for job in
+            job_results: list[Future] = [
-                                         jobs["metaData"]["mosaicProviderJobCardsModel"]["results"]]
+                executor.submit(process_job, job)
                for job in jobs["metaData"]["mosaicProviderJobCardsModel"]["results"]
            ]
        job_list = [result.result() for result in job_results if result.result()]
@@ -161,7 +175,7 @@ class IndeedScraper(Scraper):
            #: get first page to initialize session
            job_list, total_results = self.scrape_page(scraper_input, 0, session)
-            with ThreadPoolExecutor(max_workers=10) as executor:
+            with ThreadPoolExecutor(max_workers=1) as executor:
                futures: list[Future] = [
                    executor.submit(self.scrape_page, scraper_input, page, session)
                    for page in range(1, pages_to_process + 1)
@@ -210,7 +224,12 @@ class IndeedScraper(Scraper):
        jk_value = params.get("jk", [None])[0]
        formatted_url = f"{self.url}/viewjob?jk={jk_value}&spa=1"
-        response = session.get(formatted_url, allow_redirects=True)
+        try:
            response = session.get(
                formatted_url, allow_redirects=True, timeout_seconds=5
            )
        except requests.exceptions.Timeout:
            return None
        if response.status_code not in range(200, 400):
            return None
@@ -218,9 +237,10 @@ class IndeedScraper(Scraper):
        raw_description = response.json()["body"]["jobInfoWrapperModel"][
            "jobInfoModel"
        ]["sanitizedJobDescription"]
-        soup = BeautifulSoup(raw_description, "html.parser")
+        with io.StringIO(raw_description) as f:
-        text_content = " ".join(soup.get_text().split()).strip()
+            soup = BeautifulSoup(f, "html.parser")
-        return text_content
+            text_content = " ".join(soup.get_text().split()).strip()
            return text_content
    @staticmethod
    def get_job_type(job: dict) -> Optional[JobType]:
@@ -232,13 +252,18 @@ class IndeedScraper(Scraper):
        for taxonomy in job["taxonomyAttributes"]:
            if taxonomy["label"] == "job-types":
                if len(taxonomy["attributes"]) > 0:
-                    job_type_str = (
+                    label = taxonomy["attributes"][0].get("label")
-                        taxonomy["attributes"][0]["label"]
+                    if label:
-                        .replace("-", "_")
+                        job_type_str = label.replace("-", "").replace(" ", "").lower()
-                        .replace(" ", "_")
+                        # print(f"Debug: job_type_str = {job_type_str}")
-                        .upper()
+                        return IndeedScraper.get_enum_from_value(job_type_str)
-                    )
+        return None
-                    return JobType[job_type_str]
+
    @staticmethod
    def get_enum_from_value(value_str):
        for job_type in JobType:
            if value_str in job_type.value:
                return job_type
        return None
    @staticmethod
@@ -289,7 +314,7 @@ class IndeedScraper(Scraper):
        :param soup:
        :return: total_num_jobs
        """
-        script = soup.find("script", string=lambda t: "window._initialData" in t)
+        script = soup.find("script", string=lambda t: t and "window._initialData" in t)
        pattern = re.compile(r"window._initialData\s*=\s*({.*})\s*;", re.DOTALL)
        match = pattern.search(script.string)
--- a/src/jobspy/scrapers/linkedin/init.py
+++ b/src/jobspy/scrapers/linkedin/init.py
@@ -1,12 +1,21 @@
 from typing import Optional, Tuple
 from datetime import datetime
 import traceback
 import requests
 from requests.exceptions import Timeout
 from bs4 import BeautifulSoup
 from bs4.element import Tag
 from .. import Scraper, ScraperInput, Site
-from ...jobs import JobPost, Location, JobResponse, JobType, Compensation, CompensationInterval
+from ...jobs import (
    JobPost,
    Location,
    JobResponse,
    JobType,
    Compensation,
    CompensationInterval,
 )
 class LinkedInScraper(Scraper):
@@ -15,8 +24,8 @@ class LinkedInScraper(Scraper):
        Initializes LinkedInScraper with the LinkedIn job search url
        """
        site = Site(Site.LINKEDIN)
-        url = "https://www.linkedin.com"
+        self.url = "https://www.linkedin.com"
-        super().__init__(site, url)
+        super().__init__(site)
    def scrape(self, scraper_input: ScraperInput) -> JobResponse:
        """
@@ -24,6 +33,7 @@ class LinkedInScraper(Scraper):
        :param scraper_input:
        :return: job_response
        """
        self.country = "worldwide"
        job_list: list[JobPost] = []
        seen_urls = set()
        page, processed_jobs, job_count = 0, 0, 0
@@ -59,9 +69,12 @@ class LinkedInScraper(Scraper):
                )
                if response.status_code != 200:
                    reason = ' (too many requests)' if response.status_code == 429 else ''
                    return JobResponse(
                        success=False,
-                        error=f"Response returned {response.status_code}",
+                        error=f"LinkedIn returned {response.status_code} {reason}",
                        jobs=job_list,
                        total_results=job_count,
                    )
                soup = BeautifulSoup(response.text, "html.parser")
@@ -97,7 +110,7 @@ class LinkedInScraper(Scraper):
                    metadata_card = job_info.find(
                        "div", class_="base-search-card__metadata"
                    )
-                    location: Location = LinkedInScraper.get_location(metadata_card)
+                    location: Location = self.get_location(metadata_card)
                    datetime_tag = metadata_card.find(
                        "time", class_="job-search-card__listdate"
@@ -105,7 +118,10 @@ class LinkedInScraper(Scraper):
                    description, job_type = LinkedInScraper.get_description(job_url)
                    if datetime_tag:
                        datetime_str = datetime_tag["datetime"]
-                        date_posted = datetime.strptime(datetime_str, "%Y-%m-%d")
+                        try:
                            date_posted = datetime.strptime(datetime_str, "%Y-%m-%d")
                        except Exception as e:
                            date_posted = None
                    else:
                        date_posted = None
@@ -117,18 +133,18 @@ class LinkedInScraper(Scraper):
                        date_posted=date_posted,
                        job_url=job_url,
                        job_type=job_type,
-                        compensation=Compensation(interval=CompensationInterval.YEARLY, currency="USD")
+                        compensation=Compensation(
                            interval=CompensationInterval.YEARLY, currency=None
                        ),
                    )
                    job_list.append(job_post)
-                    if (
+                    if processed_jobs >= job_count:
                        len(job_list) >= scraper_input.results_wanted
                        or processed_jobs >= job_count
                    ):
                        break
-                if (
+                    if len(job_list) >= scraper_input.results_wanted:
-                    len(job_list) >= scraper_input.results_wanted
+                        break
-                    or processed_jobs >= job_count
+                if processed_jobs >= job_count:
-                ):
+                    break
                if len(job_list) >= scraper_input.results_wanted:
                    break
                page += 1
@@ -148,7 +164,11 @@ class LinkedInScraper(Scraper):
        :param job_page_url:
        :return: description or None
        """
-        response = requests.get(job_page_url, allow_redirects=True)
+        try:
            response = requests.get(job_page_url, timeout=5)
        except Timeout:
            return None, None
        if response.status_code not in range(200, 400):
            return None, None
@@ -186,17 +206,24 @@ class LinkedInScraper(Scraper):
                    employment_type = employment_type.lower()
                    employment_type = employment_type.replace("-", "")
-            return JobType(employment_type)
+            return LinkedInScraper.get_enum_from_value(employment_type)
        return text_content, get_job_type(soup)
    @staticmethod
-    def get_location(metadata_card: Optional[Tag]) -> Location:
+    def get_enum_from_value(value_str):
        for job_type in JobType:
            if value_str in job_type.value:
                return job_type
        return None
    def get_location(self, metadata_card: Optional[Tag]) -> Location:
        """
        Extracts the location data from the job metadata card.
        :param metadata_card
        :return: location
        """
        location = Location(country=self.country)
        if metadata_card is not None:
            location_tag = metadata_card.find(
                "span", class_="job-search-card__location"
@@ -208,6 +235,7 @@ class LinkedInScraper(Scraper):
                location = Location(
                    city=city,
                    state=state,
                    country=self.country,
                )
        return location
--- a/src/jobspy/scrapers/ziprecruiter/init.py
+++ b/src/jobspy/scrapers/ziprecruiter/init.py
@@ -1,8 +1,9 @@
 import math
 import json
 import re
 import traceback
 from datetime import datetime
-from typing import Optional, Tuple, List
+from typing import Optional, Tuple
 from urllib.parse import urlparse, parse_qs
 import tls_client
@@ -11,7 +12,15 @@ from bs4.element import Tag
 from concurrent.futures import ThreadPoolExecutor, Future
 from .. import Scraper, ScraperInput, Site, StatusException
-from ...jobs import JobPost, Compensation, CompensationInterval, Location, JobResponse, JobType
+from ...jobs import (
    JobPost,
    Compensation,
    CompensationInterval,
    Location,
    JobResponse,
    JobType,
    Country,
 )
 class ZipRecruiterScraper(Scraper):
@@ -20,8 +29,8 @@ class ZipRecruiterScraper(Scraper):
        Initializes LinkedInScraper with the ZipRecruiter job search url
        """
        site = Site(Site.ZIP_RECRUITER)
-        url = "https://www.ziprecruiter.com"
+        self.url = "https://www.ziprecruiter.com"
-        super().__init__(site, url)
+        super().__init__(site)
        self.jobs_per_page = 20
        self.seen_urls = set()
@@ -55,7 +64,7 @@ class ZipRecruiterScraper(Scraper):
            "search": scraper_input.search_term,
            "location": scraper_input.location,
            "page": page,
-            "form": "jobs-landing"
+            "form": "jobs-landing",
        }
        if scraper_input.is_remote:
@@ -65,14 +74,18 @@ class ZipRecruiterScraper(Scraper):
            params["radius"] = scraper_input.distance
        if job_type_value:
-            params["refine_by_employment"] = f"employment_type:employment_type:{job_type_value}"
+            params[
                "refine_by_employment"
            ] = f"employment_type:employment_type:{job_type_value}"
        response = self.session.get(
            self.url + "/jobs-search",
            headers=ZipRecruiterScraper.headers(),
            params=params,
            allow_redirects=True,
        )
        # print(response.status_code)
        if response.status_code != 200:
            raise StatusException(response.status_code)
@@ -90,11 +103,14 @@ class ZipRecruiterScraper(Scraper):
        with ThreadPoolExecutor(max_workers=10) as executor:
            if "jobList" in data and data["jobList"]:
                jobs_js = data["jobList"]
-                job_results = [executor.submit(self.process_job_js, job) for job in jobs_js]
+                job_results = [
                    executor.submit(self.process_job_js, job) for job in jobs_js
                ]
            else:
                jobs_html = soup.find_all("div", {"class": "job_content"})
-                job_results = [executor.submit(self.process_job_html, job) for job in
+                job_results = [
-                               jobs_html]
+                    executor.submit(self.process_job_html, job) for job in jobs_html
                ]
        job_list = [result.result() for result in job_results if result.result()]
@@ -107,8 +123,9 @@ class ZipRecruiterScraper(Scraper):
        :return: job_response
        """
-
+        pages_to_process = max(
-        pages_to_process = max(3, math.ceil(scraper_input.results_wanted / self.jobs_per_page))
+            3, math.ceil(scraper_input.results_wanted / self.jobs_per_page)
        )
        try:
            #: get first page to initialize session
@@ -125,7 +142,6 @@ class ZipRecruiterScraper(Scraper):
                    job_list += jobs
        except StatusException as e:
            return JobResponse(
                success=False,
@@ -162,27 +178,19 @@ class ZipRecruiterScraper(Scraper):
        title = job.find("h2", {"class": "title"}).text
        company = job.find("a", {"class": "company_name"}).text.strip()
-        description, updated_job_url = self.get_description(
+        description, updated_job_url = self.get_description(job_url)
            job_url
        )
        if updated_job_url is not None:
            job_url = updated_job_url
        if description is None:
            description = job.find("p", {"class": "job_snippet"}).text.strip()
        job_type_element = job.find("li", {"class": "perk_item perk_type"})
        job_type = None
        if job_type_element:
            job_type_text = (
-                job_type_element.text.strip()
+                job_type_element.text.strip().lower().replace("-", "").replace(" ", "")
                .lower()
                .replace("-", "")
                .replace(" ", "")
            )
-            if job_type_text == "contractor":
+            job_type = ZipRecruiterScraper.get_job_type_enum(job_type_text)
                job_type_text = "contract"
            job_type = JobType(job_type_text)
        else:
            job_type = None
        date_posted = ZipRecruiterScraper.get_date_posted(job)
@@ -199,14 +207,19 @@ class ZipRecruiterScraper(Scraper):
        return job_post
    def process_job_js(self, job: dict) -> JobPost:
        # Map the job data to the expected fields by the Pydantic model
        title = job.get("Title")
-        description = BeautifulSoup(job.get("Snippet","").strip(), "html.parser").get_text()
+        description = BeautifulSoup(
            job.get("Snippet", "").strip(), "html.parser"
        ).get_text()
        company = job.get("OrgName")
-        location = Location(city=job.get("City"), state=job.get("State"))
+        location = Location(
            city=job.get("City"), state=job.get("State"), country=Country.US_CANADA
        )
        try:
-            job_type = ZipRecruiterScraper.job_type_from_string(job.get("EmploymentType", "").replace("-", "_").lower())
+            job_type = ZipRecruiterScraper.get_job_type_enum(
                job.get("EmploymentType", "").replace("-", "_").lower()
            )
        except ValueError:
            # print(f"Skipping job due to unrecognized job type: {job.get('EmploymentType')}")
            return None
@@ -215,14 +228,14 @@ class ZipRecruiterScraper(Scraper):
        salary_parts = formatted_salary.split(" ")
        min_salary_str = salary_parts[0][1:].replace(",", "")
-        if '.' in min_salary_str:
+        if "." in min_salary_str:
            min_amount = int(float(min_salary_str) * 1000)
        else:
            min_amount = int(min_salary_str.replace("K", "000"))
        if len(salary_parts) >= 3 and salary_parts[2].startswith("$"):
            max_salary_str = salary_parts[2][1:].replace(",", "")
-            if '.' in max_salary_str:
+            if "." in max_salary_str:
                max_amount = int(float(max_salary_str) * 1000)
            else:
                max_amount = int(max_salary_str.replace("K", "000"))
@@ -232,10 +245,13 @@ class ZipRecruiterScraper(Scraper):
        compensation = Compensation(
            interval=CompensationInterval.YEARLY,
            min_amount=min_amount,
-            max_amount=max_amount
+            max_amount=max_amount,
            currency="USD/CAD",
        )
        save_job_url = job.get("SaveJobURL", "")
-        posted_time_match = re.search(r"posted_time=(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z)", save_job_url)
+        posted_time_match = re.search(
            r"posted_time=(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z)", save_job_url
        )
        if posted_time_match:
            date_time_str = posted_time_match.group(1)
            date_posted_obj = datetime.strptime(date_time_str, "%Y-%m-%dT%H:%M:%SZ")
@@ -257,33 +273,35 @@ class ZipRecruiterScraper(Scraper):
        return job_post
    @staticmethod
-    def job_type_from_string(value: str) -> Optional[JobType]:
+    def get_enum_from_value(value_str):
-        if not value:
+        for job_type in JobType:
-            return None
+            if value_str in job_type.value:
                return job_type
        return None
-        if value.lower() == "contractor":
+    @staticmethod
-            value = "contract"
+    def get_job_type_enum(job_type_str: str) -> Optional[JobType]:
-        normalized_value = value.replace("_", "")
+        for job_type in JobType:
-        for item in JobType:
+            if job_type_str in job_type.value:
-            if item.value == normalized_value:
+                return job_type
-                return item
+        return None
        raise ValueError(f"Invalid value for JobType: {value}")
-    def get_description(
+    def get_description(self, job_page_url: str) -> Tuple[Optional[str], Optional[str]]:
            self,
        job_page_url: str
    ) -> Tuple[Optional[str], Optional[str]]:
        """
        Retrieves job description by going to the job page url
        :param job_page_url:
        :param session:
        :return: description or None, response url
        """
-        response = self.session.get(
+        try:
-            job_page_url, headers=ZipRecruiterScraper.headers(), allow_redirects=True
+            response = self.session.get(
-        )
+                job_page_url,
-        if response.status_code not in range(200, 400):
+                headers=ZipRecruiterScraper.headers(),
-            return None, None
+                allow_redirects=True,
                timeout_seconds=5,
            )
        except requests.exceptions.Timeout:
            return None
        html_string = response.content
        soup_job = BeautifulSoup(html_string, "html.parser")
@@ -365,7 +383,10 @@ class ZipRecruiterScraper(Scraper):
                amounts.append(amount)
            compensation = Compensation(
-                interval=interval, min_amount=min(amounts), max_amount=max(amounts)
+                interval=interval,
                min_amount=min(amounts),
                max_amount=max(amounts),
                currency="USD/CAD",
            )
            return compensation
@@ -389,10 +410,7 @@ class ZipRecruiterScraper(Scraper):
                city, state = None, None
        else:
            city, state = None, None
-        return Location(
+        return Location(city=city, state=state, country=Country.US_CANADA)
            city=city,
            state=state,
        )
    @staticmethod
    def headers() -> dict:
--- a/src/tests/init.py
+++ b/src/tests/init.py
--- a/src/tests/test_indeed.py
+++ b/src/tests/test_indeed.py
@@ -1,4 +1,4 @@
-from jobspy import scrape_jobs
+from ..jobspy import scrape_jobs
 def test_indeed():
Author	SHA1	Message	Date
Cullen Watson	fd883178be	Thread sites (#40 )	2023-09-06 09:47:11 -05:00
Cullen Watson	70e2218c67	reduce size of jupyter notebook	2023-09-05 13:09:18 -05:00
Cullen Watson	d6947ecdd7	Update README.md	2023-09-05 13:03:32 -05:00
Cullen Watson	5191658562	Update README.md	2023-09-05 12:27:00 -05:00
Cullen Watson	1c264b8c58	Indeed country support (#38 )	2023-09-05 12:17:22 -05:00
Cullen Watson	1598d4ff63	update README.md	2023-09-04 22:58:46 -05:00
Cullen Watson	bf2460684b	update README.md	2023-09-04 22:52:21 -05:00
Cullen Watson	f5b1e95e64	version number	2023-09-03 20:05:54 -05:00
Cullen Watson	7ae7ecdee8	Validation error (#35 )	2023-09-03 20:05:31 -05:00
`@@ -1,4 +1,4 @@`
	`from jobspy import scrape_jobs`	`from ..jobspy import scrape_jobs`


	`def test_indeed():`	`def test_indeed():`