Skip to content

qinscode/SeekSpider

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

88 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SeekSpider

Smart Job Scraper for SEEK
A powerful, AI-augmented web scraping tool built with Scrapy, designed to extract, process, and analyze job listings from seek.com.au. SeekSpider enables real-time job market intelligence with tech stack trends, salary insights, and clean PostgreSQL integration.

Python Scrapy PostgreSQL Selenium AI Integration License


πŸ“š Overview

SeekSpider is a modular scraping system designed for job market analysis. It collects IT-related job postings from SEEK using Scrapy and Selenium, enriches the data with AI-powered salary and tech stack analysis, and stores everything into a PostgreSQL database with JSONB fields for flexibility and speed.


βš™οΈ Features

πŸ•Έ Data Collection

  • Scrapy crawler with category + pagination traversal
  • Selenium-based authentication
  • BeautifulSoup integration for fine-grained parsing

🧠 AI Integration

  • Extracts and analyzes technology stacks
  • Normalizes salary info
  • Generates demand statistics on tech usage

πŸ’Ύ Database & Storage

  • PostgreSQL with JSONB for flexible schema
  • Transaction-safe pipeline with smart upserts
  • Automatic job status tracking

🧰 Architecture

  • Modular class structure (DatabaseManager, AIClient, Logger, Utils)
  • Environment-configured settings
  • Batch-safe crawling and retry mechanisms

πŸš€ Getting Started

Prerequisites

  • Python 3.9+
  • PostgreSQL (with an active database)
  • Google Chrome + ChromeDriver
  • Git

Installation

git clone https://github.com/your-username/SeekSpider.git
cd SeekSpider
pip install -r requirements.txt

Configuration

Create a .env file in the root directory:

POSTGRESQL_HOST=localhost
POSTGRESQL_PORT=5432
POSTGRESQL_USER=postgres
POSTGRESQL_PASSWORD=secret
POSTGRESQL_DATABASE=seek_data
POSTGRESQL_TABLE=Jobs

SEEK_USERNAME=your_email
SEEK_PASSWORD=your_password

AI_API_KEY=your_api_key
AI_API_URL=https://api.openai.com/v1/...
AI_MODEL=gpt-4

Make sure PostgreSQL is running and your credentials are correct.


πŸƒ Run the Spider

Option 1: With main script

python main.py

Option 2: With Scrapy

scrapy crawl seek

This will log in to SEEK, collect job data, and store it into PostgreSQL.


πŸ” API Query Parameters

The spider uses Seek’s internal search API. Here’s an example:

search_params = {
    'where': 'All Perth WA',
    'classification': '6281',  # IT category
    'seekSelectAllPages': 'true',
    'locale': 'en-AU',
}
  • Supports subclassification traversal
  • Automatically paginated
  • SEO metadata enabled
  • Auth tokens handled automatically

🧱 Project Structure

SeekSpider/
β”œβ”€β”€ spiders/seek_spider.py      # Main spider
β”œβ”€β”€ pipelines.py                # Data insertion logic
β”œβ”€β”€ items.py                    # Data model
β”œβ”€β”€ settings.py                 # Scrapy settings
β”œβ”€β”€ main.py                     # Entry point
β”œβ”€β”€ db/                         # Database utilities
β”œβ”€β”€ ai/                         # AI analysis components
└── utils/                      # Parsing, token, salary analyzers

🧩 Key Modules

  • DatabaseManager: Context-managed PostgreSQL operations with retries
  • Logger: Colored logging with levels + per-component logs
  • AIClient: Handles external API requests and formatting
  • TechStackAnalyzer: NLP-based tech term extraction
  • SalaryNormalizer: Converts pay ranges to numeric bounds
  • Config: Loads and validates .env settings

πŸ—ƒ Database Schema

CREATE TABLE "Jobs"
(
    "Id"             INTEGER PRIMARY KEY,
    "JobTitle"       VARCHAR,
    "BusinessName"   VARCHAR,
    "WorkType"       VARCHAR,
    "JobDescription" TEXT,
    "PayRange"       VARCHAR,
    "Suburb"         VARCHAR,
    "Area"           VARCHAR,
    "Url"            VARCHAR,
    "AdvertiserId"   INTEGER,
    "JobType"        VARCHAR,
    "PostedDate"     TIMESTAMP,
    "ExpiryDate"     TIMESTAMP,
    "IsActive"       BOOLEAN   DEFAULT TRUE,
    "TechStack"      JSONB,
    "MinSalary"      INTEGER,
    "MaxSalary"      INTEGER,
    "CreatedAt"      TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Recommended indexes:

CREATE INDEX idx_active ON "Jobs" ("IsActive");
CREATE INDEX idx_salary ON "Jobs" ("MinSalary", "MaxSalary");
CREATE INDEX idx_techstack ON "Jobs" USING GIN ("TechStack");

🀝 Contributing

Pull requests are welcome!
Please open an issue to discuss major changes.

git checkout -b feature/my-new-feature
git commit -m "feat: add new parser"
git push origin feature/my-new-feature

πŸ“„ License

Licensed under the Apache License 2.0.


πŸ™ Acknowledgments

About

Seekspider: A Scrapy Project for Job Scraping

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •