Smart Job Scraper for SEEK
A powerful, AI-augmented web scraping tool built with Scrapy, designed to extract, process, and analyze job listings
from seek.com.au. SeekSpider enables real-time job market intelligence with tech stack
trends, salary insights, and clean PostgreSQL integration.
SeekSpider is a modular scraping system designed for job market analysis. It collects IT-related job postings from SEEK using Scrapy and Selenium, enriches the data with AI-powered salary and tech stack analysis, and stores everything into a PostgreSQL database with JSONB fields for flexibility and speed.
- Scrapy crawler with category + pagination traversal
- Selenium-based authentication
- BeautifulSoup integration for fine-grained parsing
- Extracts and analyzes technology stacks
- Normalizes salary info
- Generates demand statistics on tech usage
- PostgreSQL with JSONB for flexible schema
- Transaction-safe pipeline with smart upserts
- Automatic job status tracking
- Modular class structure (
DatabaseManager
,AIClient
,Logger
,Utils
) - Environment-configured settings
- Batch-safe crawling and retry mechanisms
- Python 3.9+
- PostgreSQL (with an active database)
- Google Chrome + ChromeDriver
- Git
git clone https://github.com/your-username/SeekSpider.git
cd SeekSpider
pip install -r requirements.txt
Create a .env
file in the root directory:
POSTGRESQL_HOST=localhost
POSTGRESQL_PORT=5432
POSTGRESQL_USER=postgres
POSTGRESQL_PASSWORD=secret
POSTGRESQL_DATABASE=seek_data
POSTGRESQL_TABLE=Jobs
SEEK_USERNAME=your_email
SEEK_PASSWORD=your_password
AI_API_KEY=your_api_key
AI_API_URL=https://api.openai.com/v1/...
AI_MODEL=gpt-4
Make sure PostgreSQL is running and your credentials are correct.
python main.py
scrapy crawl seek
This will log in to SEEK, collect job data, and store it into PostgreSQL.
The spider uses Seekβs internal search API. Hereβs an example:
search_params = {
'where': 'All Perth WA',
'classification': '6281', # IT category
'seekSelectAllPages': 'true',
'locale': 'en-AU',
}
- Supports subclassification traversal
- Automatically paginated
- SEO metadata enabled
- Auth tokens handled automatically
SeekSpider/
βββ spiders/seek_spider.py # Main spider
βββ pipelines.py # Data insertion logic
βββ items.py # Data model
βββ settings.py # Scrapy settings
βββ main.py # Entry point
βββ db/ # Database utilities
βββ ai/ # AI analysis components
βββ utils/ # Parsing, token, salary analyzers
DatabaseManager
: Context-managed PostgreSQL operations with retriesLogger
: Colored logging with levels + per-component logsAIClient
: Handles external API requests and formattingTechStackAnalyzer
: NLP-based tech term extractionSalaryNormalizer
: Converts pay ranges to numeric boundsConfig
: Loads and validates.env
settings
CREATE TABLE "Jobs"
(
"Id" INTEGER PRIMARY KEY,
"JobTitle" VARCHAR,
"BusinessName" VARCHAR,
"WorkType" VARCHAR,
"JobDescription" TEXT,
"PayRange" VARCHAR,
"Suburb" VARCHAR,
"Area" VARCHAR,
"Url" VARCHAR,
"AdvertiserId" INTEGER,
"JobType" VARCHAR,
"PostedDate" TIMESTAMP,
"ExpiryDate" TIMESTAMP,
"IsActive" BOOLEAN DEFAULT TRUE,
"TechStack" JSONB,
"MinSalary" INTEGER,
"MaxSalary" INTEGER,
"CreatedAt" TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Recommended indexes:
CREATE INDEX idx_active ON "Jobs" ("IsActive");
CREATE INDEX idx_salary ON "Jobs" ("MinSalary", "MaxSalary");
CREATE INDEX idx_techstack ON "Jobs" USING GIN ("TechStack");
Pull requests are welcome!
Please open an issue to discuss major changes.
git checkout -b feature/my-new-feature
git commit -m "feat: add new parser"
git push origin feature/my-new-feature
Licensed under the Apache License 2.0.
- Scrapy for the powerful crawling engine
- Selenium for seamless login automation
- BeautifulSoup for DOM parsing