Jobs scraper library for LinkedIn, Indeed, Glassdoor & ZipRecruiter
Go to file
Cullen Watson 6c73f37fc3 change dir on docker 2023-08-28 11:44:10 -05:00
.github/workflows Update docker-build.yml 2023-08-28 09:12:20 -05:00
.vscode chore: postman 2023-08-26 18:46:52 -05:00
api change dir on docker 2023-08-28 11:44:10 -05:00
postman Google sheets integration (#22) 2023-08-27 20:32:46 -05:00
.gitignore Google sheets integration (#22) 2023-08-27 20:32:46 -05:00
Dockerfile add docker workflow 2023-08-28 07:57:09 -05:00
LICENSE docs: Create LICENSE 2023-08-26 18:47:48 -05:00
README.md readme 2023-08-28 11:03:10 -05:00
main.py Add Csv output (#20) 2023-08-27 16:25:48 -05:00
requirements.txt Google sheets integration (#22) 2023-08-27 20:32:46 -05:00
settings.py fix file not found 2023-08-28 11:19:11 -05:00

README.md

JobSpy AIO Scraper

Features

  • Scrapes job postings from LinkedIn, Indeed & ZipRecruiter simultaneously
  • Returns jobs as JSON or CSV with title, location, company, description & other data
  • Imports directly into Google Sheets
  • Optional JWT authorization

jobspy_gsheet

API

POST /api/v1/jobs/

Request Schema

Required
├── site_type (List[enum]): linkedin, zip_recruiter, indeed
└── search_term (str)
Optional
├── location (int)
├── distance (int)
├── job_type (enum): fulltime, parttime, internship, contract
├── is_remote (bool)
├── results_wanted (int): per site_type
├── easy_apply (bool): only for linkedin
└── output_format (enum): json, csv, gsheet

Request Example

"site_type": ["indeed", "linkedin"],
"search_term": "software engineer",
"location": "austin, tx",
"distance": 10,
"job_type": "fulltime",
"results_wanted": 15
"output_format": "gsheet"

Response Schema

site_type (enum): 
JobResponse
├── success (bool)
├── error (str)
├── jobs (List[JobPost])
│   └── JobPost
│       ├── title (str)
│       ├── company_name (str)
│       ├── job_url (str)
│       ├── location (object)
│       │   ├── country (str)
│       │   ├── city (str)
│       │   ├── state (str)
│       ├── description (str)
│       ├── job_type (enum)
│       ├── compensation (object)
│       │   ├── interval (CompensationInterval): yearly, monthly, weekly, daily, hourly
│       │   ├── min_amount (float)
│       │   ├── max_amount (float)
│       │   └── currency (str)
│       └── date_posted (datetime)
│
├── total_results (int)
└── returned_results (int) 

Response Example (GOOGLE SHEETS)

{
    "status": "Successfully uploaded to Google Sheets",
    "error": null,
    "linkedin": null,
    "indeed": null,
    "zip_recruiter": null
}

Response Example (JSON)

{
    "indeed": {
        "success": true,
        "error": null,
        "jobs": [
            {
                "title": "Software Engineer",
                "company_name": "INTEL",
                "job_url": "https://www.indeed.com/jobs/viewjob?jk=a2cfbb98d2002228",
                "location": {
                    "country": "USA",
                    "city": "Austin",
                    "state": "TX",
                },
                "description": "Job Description Designs, develops, tests, and debugs..."
                "job_type": "fulltime",
                "compensation": {
                    "interval": "yearly",
                    "min_amount": 209760.0,
                    "max_amount": 139480.0,
                    "currency": "USD"
                },
                "date_posted": "2023-08-18T00:00:00"
            }, ...
        ],
        "total_results": 845,
        "returned_results": 15
    },
    "linkedin": {
        "success": true,
        "error": null,
        "jobs": [
            {
                "title": "Software Engineer 1",
                "company_name": "Public Partnerships | PPL",
                "job_url": "https://www.linkedin.com/jobs/view/3690013792",
                "location": {
                    "country": "USA",
                    "city": "Austin",
                    "state": "TX",
                },
                "description": "Public Partnerships LLC supports individuals with disabilities..."
                "job_type": null,
                "compensation": null,
                "date_posted": "2023-07-31T00:00:00"
            }, ...
        ],
        "total_results": 2000,
        "returned_results": 15
    }
}

Response Example (CSV)

Site, Title, Company Name, Job URL, Country, City, State, Job Type, Compensation Interval, Min Amount, Max Amount, Currency, Date Posted, Description
indeed, Software Engineer, INTEL, https://www.indeed.com/jobs/viewjob?jk=a2cfbb98d2002228, USA, Austin, TX, fulltime, yearly, 209760.0, 139480.0, USD, 2023-08-18T00:00:00, Job Description Designs...
linkedin, Software Engineer 1, Public Partnerships | PPL, https://www.linkedin.com/jobs/view/3690013792, USA, Austin, TX, , , , , , 2023-07-31T00:00:00, Public Partnerships LLC supports...

Installation

Docker Image (simple)

Requires Docker Desktop

Our Docker image is continuously updated and available on GitHub Container Registry. You can pull and use the image with:

docker pull ghcr.io/cullenwatson/jobspy:latest

Usage Docker

To pull the Docker image:

docker pull ghcr.io/cullenwatson/jobspy:latest

Default params

By default,

  • the Docker image expects client_secret.json (if using Google Sheets, to obtain see below) in the same directory as your terminal
  • Listens on port 8000
  • Put the jobs into a sheet that is named JobSpy

To run the image with these default settings, use:

docker run -v client_secret.json:/app/client_secret.json -p 8000:8000 ghcr.io/cullenwatson/jobspy

Using custom params (port, path & sheet name)

For a custom port and path configuration, for example,

  • port 8030,
  • path C:\config\client_secret.json
  • Google sheet name "JobSheet"
docker run -v C:\config\client_secret.json:/app/client_secret.json -e GSHEET_NAME=JobSheet -e PORT=8030 -p 8030:8030 ghcr.io/cullenwatson/jobspy

Usage

Google Sheets Integration (you need a client_secret.json)

Obtaining an Access Key: Video Guide

  • Enable the Google Sheets & Google Drive API
  • Create credentials -> service account -> create & continue
  • Select role -> basic: editor -> done
  • Click on the email you just created in the service account list
  • Go to the Keys tab -> add key -> create new key -> JSON -> Create

Using the key in the repo

  • Copy the key file into the JobSpy repo as /client_secret.json
  • Go to my template sheet & save as a copy into your account
  • Share the Google sheet with the email in client_email in the client_secret.json above with editor rights
  • If you changed the name of the sheet, put the name in GSHEET_NAME in /settings.py

Python installtion (alternative to Docker)

Python version >= 3.10 required

  1. Clone this repository git clone https://github.com/cullenwatson/jobspy
  2. Install the dependencies with pip install -r requirements.txt
  3. Run the server with uvicorn main:app --reload

How to call the API

Postman (preferred):

To use Postman:

  1. Locate the files in the /postman/ directory.
  2. Import the Postman collection and environment JSON files.

Swagger UI:

Or you can call the API with the interactive documentation at localhost:8000/docs.

FAQ

I'm having issues with my queries. What should I do?

Try reducing the number of results_wanted and/or broadening the filters. If issues still persist, feel free to submit an issue.

I'm getting response code 429. What should I do?

You have been blocked by the job board site for sending too many requests. Wait a couple seconds or use a VPN.

How to enable auth?

Change AUTH_REQUIRED in /settings.py to True

The auth uses supabase. Create a project with a users table and disable RLS.

Add these three environment variables:

  • SUPABASE_URL: go to project settings -> API -> Project URL
  • SUPABASE_KEY: go to project settings -> API -> service_role secret
  • JWT_SECRET_KEY - type openssl rand -hex 32 in terminal to create a 32 byte secret key

Use these endpoints to register and get an access token:

image