🏛 RTO Big Data Web Scraper & AI-Powered Crawler

An automated data extraction pipeline that scrapes, cleans, and validates Registered Training Organisation (RTO) data from training.gov.au.

This project combines traditional scraping (Selenium, BeautifulSoup, Playwright) with LLM-powered crawling for JavaScript-heavy content, enabling large-scale, up-to-date, and structured educational course datasets.

✨ Features

📊 Processes 4,000+ RTOs and 15,000+ courses in one execution
🔄 Automated pagination & deep crawling from gov pages to official RTO sites
🤖 LLM-assisted data extraction with:
- Gemini 2.5 Flash – cost-efficient (<90k tokens/op)
- DeepSeek R1 – semantic verification & keyword matching
🖥 Hybrid HTML/Markdown/JSON parsing for dynamic content
🗂 CSV & JSON output for backend/API pipelines
🧹 Data cleaning & normalization with pandas
⚡ Headless browser mode for faster execution

🛠 Technologies Used

Scraping & Automation

Python 3.11+
Selenium – DOM scraping & interactions
BeautifulSoup4 – HTML parsing
Playwright – JavaScript-rendered content scraping
Crawl4AI – AI-guided crawling & sub-URL targeting

AI Models

Gemini 2.5 Flash – Structured extraction from complex layouts
DeepSeek-R1 – Course existence & metadata verification

Data Processing & Export

pandas – Data cleaning & transformation
CSV – Government schema-compatible export
JSON – API-ready format

📂 Workflow

Load Input CSV
- Columns: Code, Web Address
Phase 1 – Government Scraping
For each code, scrape:
- /summary – Organisation details
- /contacts – Contact info
- /addresses – Physical/postal addresses
- /qualifications – Offered qualifications
Phase 2 – AI Verification
- Visit each RTO’s official website
- Search for each course using LLM keyword prompts
- Flag discrepancies & missing courses
Phase 3 – Cleaning & Structuring
- Normalize dates, addresses, contact info
- Remove duplicates
- Match to CSV schema
Output
- Final CSV
- Summary report of broken links & mismatches

🔧 Getting Started

# 1. Clone the repository
git clone https://github.com/your-username/rto-big-data-scraper.git
cd rto-big-data-scraper

# 2. Create a virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Place your input CSV in /data/input.csv

# 5. Run the scraper
python scrape_rtos.py --input data/input.csv --output data/final_rtos.csv

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.vscode		.vscode
__pycache__		__pycache__
data		data
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
output.json		output.json
prompt.txt		prompt.txt
requirements.txt		requirements.txt
rto_schema.json		rto_schema.json
rto_scrape.py		rto_scrape.py
scraper.py		scraper.py
to_csv.py		to_csv.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🏛 RTO Big Data Web Scraper & AI-Powered Crawler

✨ Features

🛠 Technologies Used

Scraping & Automation

AI Models

Data Processing & Export

📂 Workflow

🔧 Getting Started

About

Uh oh!

Releases

Packages

Languages

puureya2/llm-powered-web-scraper

Folders and files

Latest commit

History

Repository files navigation

🏛 RTO Big Data Web Scraper & AI-Powered Crawler

✨ Features

🛠 Technologies Used

Scraping & Automation

AI Models

Data Processing & Export

📂 Workflow

🔧 Getting Started

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages