docs:readme

Increment version
2026-03-05 12:04:33 -08:00 · 2025-02-09 13:42:18 -06:00 · 2025-01-17 21:44:49 -06:00 · 2024-12-04 22:55:06 +00:00 · 2024-12-04 16:54:52 -06:00 · 2024-12-04 16:52:15 -06:00
8 changed files with 221 additions and 142 deletions
--- a/.github/workflows/publish-to-pypi.yml
+++ b/.github/workflows/publish-to-pypi.yml
@@ -1,33 +1,50 @@
 name: Publish Python 🐍 distributions 📦 to PyPI
-on: push
+on:
+  pull_request:
+    types:
+      - closed
+
+permissions:
+  contents: write

 jobs:
  build-n-publish:
    name: Build and publish Python 🐍 distributions 📦 to PyPI
    runs-on: ubuntu-latest

+    if: github.event.pull_request.merged == true && github.event.pull_request.base.ref == 'main'
+
    steps:
      - uses: actions/checkout@v3
+
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.10"

+      - name: Install dependencies
+        run: pip install toml
+
+      - name: Increment version
+        run: python increment_version.py
+
+      - name: Commit version increment
+        run: |
+          git config --global user.name 'github-actions'
+          git config --global user.email 'github-actions@github.com'
+          git add pyproject.toml
+          git commit -m 'Increment version'
+
+      - name: Push changes
+        run: git push
+
      - name: Install poetry
-        run: >-
-          python3 -m
-          pip install
-          poetry
-          --user
+        run: pip install poetry --user

      - name: Build distribution 📦
-        run: >-
-          python3 -m
-          poetry
-          build
+        run: poetry build

      - name: Publish distribution 📦 to PyPI
-        if: startsWith(github.ref, 'refs/tags')
        uses: pypa/gh-action-pypi-publish@release/v1
        with:
-          password: ${{ secrets.PYPI_API_TOKEN }}
+          password: ${{ secrets.PYPI_API_TOKEN }}
--- a/README.md
+++ b/README.md
@@ -2,16 +2,11 @@

 **JobSpy** is a simple, yet comprehensive, job scraping library.

-**Not technical?** Try out the web scraping tool on our site at [usejobspy.com](https://usejobspy.com).
-
-*Looking to build a data-focused software product?* **[Book a call](https://bunsly.com/)** *to
-work with us.*
-
 ## Features

 - Scrapes job postings from **LinkedIn**, **Indeed**, **Glassdoor**, **Google**, & **ZipRecruiter** simultaneously
- Aggregates the job postings in a Pandas DataFrame
- Proxies support
+- Aggregates the job postings in a dataframe
+- Proxies support to bypass blocking

 ![jobspy](https://github.com/cullenwatson/JobSpy/assets/78247585/ec7ef355-05f6-4fd3-8161-a817e31c5c57)

@@ -32,14 +27,14 @@ from jobspy import scrape_jobs
 jobs = scrape_jobs(
    site_name=["indeed", "linkedin", "zip_recruiter", "glassdoor", "google"],
    search_term="software engineer",
+    google_search_term="software engineer jobs near San Francisco, CA since yesterday",
    location="San Francisco, CA",
    results_wanted=20,
-    hours_old=72, # (only Linkedin/Indeed is hour specific, others round up to days old)
-    country_indeed='USA',  # only needed for indeed / glassdoor
+    hours_old=72,
+    country_indeed='USA',
    
-    # linkedin_fetch_description=True # get more info such as full description, direct job url for linkedin (slower)
+    # linkedin_fetch_description=True # gets more info such as description, direct job url (slower)
    # proxies=["208.195.175.46:65095", "208.195.175.45:65095", "localhost"],
-    
 )
 print(f"Found {len(jobs)} jobs")
 print(jobs.head())
@@ -63,10 +58,13 @@ zip_recruiter Software Developer                 TEKsystems        Phoenix
 ```plaintext
 Optional
 ├── site_name (list|str): 
-|    linkedin, zip_recruiter, indeed, glassdoor 
-|    (default is all four)
+|    linkedin, zip_recruiter, indeed, glassdoor, google
+|    (default is all)
 │
 ├── search_term (str)
+|
+├── google_search_term (str)
+|     search term for google jobs. This is the only param for filtering google jobs.
 │
 ├── location (str)
 │
@@ -86,7 +84,7 @@ Optional
 |    number of job results to retrieve for each site specified in 'site_name'
 │
 ├── easy_apply (bool): 
-|    filters for jobs that are hosted on the job board site
+|    filters for jobs that are hosted on the job board site (LinkedIn easy apply filter no longer works)
 │
 ├── description_format (str): 
 |    markdown, html (Format type of the job descriptions. Default is markdown.)
@@ -131,46 +129,6 @@ Optional
 |    - easy_apply
 ```

-
-### JobPost Schema
-
-```plaintext
-JobPost
-├── title
-├── company
-├── company_url
-├── job_url
-├── location
-│   ├── country
-│   ├── city
-│   ├── state
-├── description
-├── job_type: fulltime, parttime, internship, contract
-├── job_function
-│   ├── interval: yearly, monthly, weekly, daily, hourly
-│   ├── min_amount
-│   ├── max_amount
-│   ├── currency
-│   └── salary_source: direct_data, description (parsed from posting)
-├── date_posted
-├── emails
-└── is_remote
-
-Linkedin specific
-└── job_level
-
-Linkedin & Indeed specific
-└── company_industry
-
-Indeed specific
-├── company_country
-├── company_addresses
-├── company_employees_label
-├── company_revenue_label
-├── company_description
-└── company_logo
-```
-
 ## Supported Countries for Job Searching

 ### **LinkedIn**
@@ -217,7 +175,23 @@ You can specify the following countries when searching on Indeed (use the exact

 ---
 **Q: Why is Indeed giving unrelated roles?**  
-**A:** Indeed is searching each one of your terms e.g. software intern, it searches software OR intern. Try search_term='"software intern"' in quotes for stricter searching
+**A:** Indeed searches the description too.
+
+- use - to remove words
+- "" for exact match
+
+Example of a good Indeed query
+
+```py
+search_term='"engineering intern" software summer (java OR python OR c++) 2025 -tax -marketing'
+```
+
+This searches the description/title and must include software, summer, 2025, one of the languages, engineering intern exactly, no tax, no marketing.
+
+---
+
+**Q: No results when using "google"?**  
+**A:** You have to use super specific syntax. Search for google jobs on your browser and then whatever pops up in the google jobs search box after applying some filters is what you need to copy & paste into the google_search_term. 

 ---

@@ -229,8 +203,41 @@ You can specify the following countries when searching on Indeed (use the exact

 ---

-**Q: Encountering issues with your queries?**  
-**A:** Try reducing the number of `results_wanted` and/or broadening the filters. If problems
-persist, [submit an issue](https://github.com/Bunsly/JobSpy/issues).
+### JobPost Schema

---
+```plaintext
+JobPost
+├── title
+├── company
+├── company_url
+├── job_url
+├── location
+│   ├── country
+│   ├── city
+│   ├── state
+├── description
+├── job_type: fulltime, parttime, internship, contract
+├── job_function
+│   ├── interval: yearly, monthly, weekly, daily, hourly
+│   ├── min_amount
+│   ├── max_amount
+│   ├── currency
+│   └── salary_source: direct_data, description (parsed from posting)
+├── date_posted
+├── emails
+└── is_remote
+
+Linkedin specific
+└── job_level
+
+Linkedin & Indeed specific
+└── company_industry
+
+Indeed specific
+├── company_country
+├── company_addresses
+├── company_employees_label
+├── company_revenue_label
+├── company_description
+└── company_logo
+```
--- a/increment_version.py
+++ b/increment_version.py
@@ -0,0 +1,21 @@
+import toml
+
+def increment_version(version):
+    major, minor, patch = map(int, version.split('.'))
+    patch += 1
+    return f"{major}.{minor}.{patch}"
+
+# Load pyproject.toml
+with open('pyproject.toml', 'r') as file:
+    pyproject = toml.load(file)
+
+# Increment the version
+current_version = pyproject['tool']['poetry']['version']
+new_version = increment_version(current_version)
+pyproject['tool']['poetry']['version'] = new_version
+
+# Save the updated pyproject.toml
+with open('pyproject.toml', 'w') as file:
+    toml.dump(pyproject, file)
+
+print(f"Version updated from {current_version} to {new_version}")
--- a/poetry.toml
+++ b/poetry.toml
@@ -1,2 +0,0 @@
-[virtualenvs]
-in-project = true
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -1,15 +1,21 @@
+[build-system]
+requires = [ "poetry-core",]
+build-backend = "poetry.core.masonry.api"
+
 [tool.poetry]
 name = "python-jobspy"
-version = "1.1.73"
+version = "1.1.76"
 description = "Job scraper for LinkedIn, Indeed, Glassdoor & ZipRecruiter"
-authors = ["Zachary Hampton <zachary@bunsly.com>", "Cullen Watson <cullen@bunsly.com>"]
+authors = [ "Zachary Hampton <zachary@bunsly.com>", "Cullen Watson <cullen@bunsly.com>",]
 homepage = "https://github.com/Bunsly/JobSpy"
 readme = "README.md"
-keywords = ['jobs-scraper', 'linkedin', 'indeed', 'glassdoor', 'ziprecruiter']
+keywords = [ "jobs-scraper", "linkedin", "indeed", "glassdoor", "ziprecruiter",]
+[[tool.poetry.packages]]
+include = "jobspy"
+from = "src"

-packages = [
-    { include = "jobspy", from = "src" }
-]
+[tool.black]
+line-length = 88

 [tool.poetry.dependencies]
 python = "^3.10"
@@ -22,16 +28,8 @@ tls-client = "^1.0.1"
 markdownify = "^0.13.1"
 regex = "^2024.4.28"

-
 [tool.poetry.group.dev.dependencies]
 pytest = "^7.4.1"
 jupyter = "^1.0.0"
 black = "*"
 pre-commit = "*"
-
-[build-system]
-requires = ["poetry-core"]
-build-backend = "poetry.core.masonry.api"
-
-[tool.black]
-line-length = 88
--- a/src/jobspy/init.py
+++ b/src/jobspy/init.py
@@ -24,6 +24,7 @@ from .scrapers.exceptions import (
 def scrape_jobs(
    site_name: str | list[str] | Site | list[Site] | None = None,
    search_term: str | None = None,
+    google_search_term: str | None = None,
    location: str | None = None,
    distance: int | None = 50,
    is_remote: bool = False,
@@ -86,6 +87,7 @@ def scrape_jobs(
        site_type=get_site_type(),
        country=country_enum,
        search_term=search_term,
+        google_search_term=google_search_term,
        location=location,
        distance=distance,
        is_remote=is_remote,
@@ -216,8 +218,8 @@ def scrape_jobs(
            "title",
            "company",
            "location",
-            "job_type",
            "date_posted",
+            "job_type",
            "salary_source",
            "interval",
            "min_amount",
@@ -248,6 +250,8 @@ def scrape_jobs(
        jobs_df = jobs_df[desired_order]

        # Step 4: Sort the DataFrame as required
-        return jobs_df.sort_values(by=["site", "date_posted"], ascending=[True, False])
+        return jobs_df.sort_values(
+            by=["site", "date_posted"], ascending=[True, False]
+        ).reset_index(drop=True)
    else:
        return pd.DataFrame()
--- a/src/jobspy/scrapers/init.py
+++ b/src/jobspy/scrapers/init.py
@@ -28,6 +28,7 @@ class SalarySource(Enum):
 class ScraperInput(BaseModel):
    site_type: list[Site]
    search_term: str | None = None
+    google_search_term: str | None = None

    location: str | None = None
    country: Country | None = Country.USA
--- a/src/jobspy/scrapers/google/init.py
+++ b/src/jobspy/scrapers/google/init.py
@@ -2,7 +2,7 @@
 jobspy.scrapers.google
 ~~~~~~~~~~~~~~~~~~~

-This module contains routines to scrape Glassdoor.
+This module contains routines to scrape Google.
 """

 from __future__ import annotations
@@ -34,12 +34,11 @@ class GoogleJobsScraper(Scraper):
        self, proxies: list[str] | str | None = None, ca_cert: str | None = None
    ):
        """
-        Initializes GlassdoorScraper with the Glassdoor job search url
+        Initializes Google Scraper with the Goodle jobs search url
        """
        site = Site(Site.GOOGLE)
        super().__init__(site, proxies=proxies, ca_cert=ca_cert)

-        self.base_url = None
        self.country = None
        self.session = None
        self.scraper_input = None
@@ -50,24 +49,24 @@ class GoogleJobsScraper(Scraper):

    def scrape(self, scraper_input: ScraperInput) -> JobResponse:
        """
-        Scrapes Glassdoor for jobs with scraper_input criteria.
+        Scrapes Google for jobs with scraper_input criteria.
        :param scraper_input: Information about job search criteria.
        :return: JobResponse containing a list of jobs.
        """
        self.scraper_input = scraper_input
        self.scraper_input.results_wanted = min(900, scraper_input.results_wanted)
-        self.base_url = self.scraper_input.country.get_glassdoor_url()

        self.session = create_session(
            proxies=self.proxies, ca_cert=self.ca_cert, is_tls=False, has_retry=True
        )
-        forward_cursor = self._get_initial_cursor()
+        forward_cursor, job_list = self._get_initial_cursor_and_jobs()
        if forward_cursor is None:
-            logger.error("initial cursor not found")
-            return JobResponse(jobs=[])
+            logger.warning(
+                "initial cursor not found, try changing your query or there was at most 10 results"
+            )
+            return JobResponse(jobs=job_list)

        page = 1
-        job_list: list[JobPost] = []

        while (
            len(self.seen_urls) < scraper_input.results_wanted + scraper_input.offset
@@ -76,7 +75,11 @@ class GoogleJobsScraper(Scraper):
            logger.info(
                f"search page: {page} / {math.ceil(scraper_input.results_wanted / self.jobs_per_page)}"
            )
-            jobs, forward_cursor = self._get_jobs_next_page(forward_cursor)
+            try:
+                jobs, forward_cursor = self._get_jobs_next_page(forward_cursor)
+            except Exception as e:
+                logger.error(f"failed to get jobs on page: {page}, {e}")
+                break
            if not jobs:
                logger.info(f"found no jobs on page: {page}")
                break
@@ -89,8 +92,8 @@ class GoogleJobsScraper(Scraper):
            ]
        )

-    def _get_initial_cursor(self):
-        """Gets initial cursor to paginate through job listings"""
+    def _get_initial_cursor_and_jobs(self) -> Tuple[str, list[JobPost]]:
+        """Gets initial cursor and jobs to paginate through job listings"""
        query = f"{self.scraper_input.search_term} jobs"

        def get_time_range(hours_old):
@@ -123,13 +126,22 @@ class GoogleJobsScraper(Scraper):
        if self.scraper_input.is_remote:
            query += " remote"

+        if self.scraper_input.google_search_term:
+            query = self.scraper_input.google_search_term
+
        params = {"q": query, "udm": "8"}
        response = self.session.get(self.url, headers=headers_initial, params=params)

        pattern_fc = r'<div jsname="Yust4d"[^>]+data-async-fc="([^"]+)"'
        match_fc = re.search(pattern_fc, response.text)
        data_async_fc = match_fc.group(1) if match_fc else None
-        return data_async_fc
+        jobs_raw = self._find_job_info_initial_page(response.text)
+        jobs = []
+        for job_raw in jobs_raw:
+            job_post = self._parse_job(job_raw)
+            if job_post:
+                jobs.append(job_post)
+        return data_async_fc, jobs

    def _get_jobs_next_page(self, forward_cursor: str) -> Tuple[list[JobPost], str]:
        params = {"fc": [forward_cursor], "fcv": ["3"], "async": [async_param]}
@@ -149,55 +161,55 @@ class GoogleJobsScraper(Scraper):
        match_fc = re.search(pattern_fc, job_data)
        data_async_fc = match_fc.group(1) if match_fc else None
        jobs_on_page = []
-
        for array in parsed:
-
            _, job_data = array
            if not job_data.startswith("[[["):
                continue
            job_d = json.loads(job_data)

            job_info = self._find_job_info(job_d)
-
-            job_url = job_info[3][0][0] if job_info[3] and job_info[3][0] else None
-            if job_url in self.seen_urls:
-                continue
-            self.seen_urls.add(job_url)
-
-            title = job_info[0]
-            company_name = job_info[1]
-            location = city = job_info[2]
-            state = country = date_posted = None
-            if location and "," in location:
-                city, state, *country = [*map(lambda x: x.strip(), location.split(","))]
-
-            days_ago_str = job_info[12]
-            if type(days_ago_str) == str:
-                match = re.search(r"\d+", days_ago_str)
-                days_ago = int(match.group()) if match else None
-                date_posted = (datetime.now() - timedelta(days=days_ago)).date()
-
-            description = job_info[19]
-
-            job_post = JobPost(
-                id=f"go-{job_info[28]}",
-                title=title,
-                company_name=company_name,
-                location=Location(
-                    city=city, state=state, country=country[0] if country else None
-                ),
-                job_url=job_url,
-                job_url_direct=job_url,
-                date_posted=date_posted,
-                is_remote="remote" in description.lower()
-                or "wfh" in description.lower(),
-                description=description,
-                emails=extract_emails_from_text(description),
-                job_type=extract_job_type(description),
-            )
-            jobs_on_page.append(job_post)
+            job_post = self._parse_job(job_info)
+            if job_post:
+                jobs_on_page.append(job_post)
        return jobs_on_page, data_async_fc

+    def _parse_job(self, job_info: list):
+        job_url = job_info[3][0][0] if job_info[3] and job_info[3][0] else None
+        if job_url in self.seen_urls:
+            return
+        self.seen_urls.add(job_url)
+
+        title = job_info[0]
+        company_name = job_info[1]
+        location = city = job_info[2]
+        state = country = date_posted = None
+        if location and "," in location:
+            city, state, *country = [*map(lambda x: x.strip(), location.split(","))]
+
+        days_ago_str = job_info[12]
+        if type(days_ago_str) == str:
+            match = re.search(r"\d+", days_ago_str)
+            days_ago = int(match.group()) if match else None
+            date_posted = (datetime.now() - timedelta(days=days_ago)).date()
+
+        description = job_info[19]
+
+        job_post = JobPost(
+            id=f"go-{job_info[28]}",
+            title=title,
+            company_name=company_name,
+            location=Location(
+                city=city, state=state, country=country[0] if country else None
+            ),
+            job_url=job_url,
+            date_posted=date_posted,
+            is_remote="remote" in description.lower() or "wfh" in description.lower(),
+            description=description,
+            emails=extract_emails_from_text(description),
+            job_type=extract_job_type(description),
+        )
+        return job_post
+
    @staticmethod
    def _find_job_info(jobs_data: list | dict) -> list | None:
        """Iterates through the JSON data to find the job listings"""
@@ -215,3 +227,24 @@ class GoogleJobsScraper(Scraper):
                if result:
                    return result
        return None
+
+    @staticmethod
+    def _find_job_info_initial_page(html_text: str):
+        pattern = (
+            f'520084652":('
+            + r"\[.*?\]\s*])\s*}\s*]\s*]\s*]\s*]\s*]"
+        )
+        results = []
+        matches = re.finditer(pattern, html_text)
+
+        import json
+
+        for match in matches:
+            try:
+                parsed_data = json.loads(match.group(1))
+                results.append(parsed_data)
+
+            except json.JSONDecodeError as e:
+                logger.error(f"Failed to parse match: {str(e)}")
+                results.append({"raw_match": match.group(0), "error": str(e)})
+        return results
Author	SHA1	Message	Date
Cullen Watson	13c74a0fed	docs:readme	2025-02-09 13:42:18 -06:00
Cullen Watson	333e9e6760	docs:readme	2025-01-17 21:44:49 -06:00
github-actions	04032a0f91	Increment version	2024-12-04 22:55:06 +00:00
Cullen Watson	496896d0b5	enh:fix yml (#225 )	2024-12-04 16:54:52 -06:00
Cullen Watson	87ba1ad1bf	fix yml	2024-12-04 16:52:15 -06:00
Jason Geffner	4e7ac9a583	Fix Google job search (#223 ) The previous regex did not capture all expected matches in the returned content	2024-12-04 16:45:59 -06:00
Cullen Watson	e44d13e1cf	enh:auto update version	2024-12-04 16:29:38 -06:00
Cullen Watson	d52e366ef7	docs:readme	2024-11-26 15:51:26 -06:00
Cullen Watson	395ebf0017	docs:readme	2024-11-26 15:49:12 -06:00
Cullen Watson	63fddd9b7f	docs:readme	2024-11-26 15:48:22 -06:00
Cullen Watson	58956868ae	docs:readme	2024-11-26 15:47:10 -06:00
Cullen Watson	4fce836222	docs:readme	2024-10-28 03:53:59 -05:00
Cullen Watson	5ba25e7a7c	docs:readme	2024-10-28 03:42:19 -05:00
Cullen Watson	f7cb3e9206	docs:readme	2024-10-28 03:36:21 -05:00
Cullen Watson	3ad3f121f7	docs:readme	2024-10-28 03:34:52 -05:00
Cullen Watson	ff3c782912	docs:readme	2024-10-25 18:12:08 -05:00
Cullen Watson	338d854b96	fix(google): search (#216 )	2024-10-25 14:54:14 -05:00
Cullen Watson	811d4c40b4	chore:version	2024-10-24 15:28:25 -05:00
Cullen Watson	dba92d22c2	chore:version	2024-10-24 15:27:16 -05:00
Cullen Watson	10a3592a0f	docs:file	2024-10-24 15:26:49 -05:00
Cullen Watson	b7905cc756	docs:file	2024-10-24 15:24:18 -05:00
Cullen Watson	6867d58829	docs:readme	2024-10-24 15:22:31 -05:00