🏠 PropFlux

PropFlux is a scalable real estate data extraction engine designed for resilient, multi-site scraping. Built with Scrapy, with Selenium + NopeCHA support for dynamic fields, and an admin dashboard that makes long-running jobs observable and controllable.

Why clients pick this (quick pitch)

You get a repeatable scraping pipeline: run jobs, monitor progress, and export clean results.
Anti-bot resilience is built in: proxies, retry-aware crawling, and browser stealth/CAPTCHA solving.
The project is operator-friendly: live logs, telemetry, termination controls, and search/exploration of results.

🎯 Features

Multi-site support: Architecture ready for scale, with high-fidelity support for Property24 and Private Property.
Robust scraping: Automatic pagination, retry logic, and error handling.
Data normalization: Standardizes prices, locations, and property details.
Deduplication: Removes duplicate listings based on ID or URL.
Multiple export formats: CSV, SQLite, and finalized JSON arrays.
Memory-Efficient: Periodic flushing to disk for large-scale scraping.
Stealth Infrastructure: Full proxy rotation and anti-bot bypassing (NopeCHA integration).
Dynamic Content Support: Selenium-based extraction for JavaScript-heavy elements (agent details, phone numbers).
Admin dashboard (monitoring + job control): React + FastAPI UI for telemetry, live logs, job termination, analytics, and data exploration.

📊 Current State

PropFlux is running as a complete scraping + data pipeline system with: incremental exports (CSV/JSON/SQLite), optional Selenium/NopeCHA dynamic extraction, and a FastAPI + React dashboard for monitoring, job lifecycle control, and analytics. Ongoing work focuses on adding new targets and improving dashboard coverage as more fields are standardized.

📋 Requirements

Python 3.11+
Chrome Browser (for Selenium-based extraction)
See requirements.txt for dependencies

🚀 Quick Start

1. Setup

# Clone or navigate to project directory
cd multi-site-real-estate-scraper

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. Run the Scraper

# Scrape Property24 (default settings)
python runner.py --site property24

# Scrape Private Property
python runner.py --site privateproperty

# Hard limit total listings (useful for testing)
python runner.py --site property24 --limit 10

# Skip expensive Selenium dynamic extraction (for rapid testing of Scrapy logic)
python runner.py --site privateproperty --limit 5 --skip-dynamic-fields

# Use custom URL (search results or single listing)
python runner.py --site property24 --url "https://www.property24.com/for-sale/cape-town/western-cape/432"

# Verbose logging
python runner.py --site property24 --verbose

📞 Dynamic Data Extraction (Selenium + NopeCHA)

Some sites hide agent details (names, phone numbers) behind JavaScript or CAPTCHAs. This project uses the BrowserService with a persistent Chrome profile to auto-solve CAPTCHAs via the NopeCHA extension.

The chrome-profiles/ directory is gitignored — your API key and session data are never committed.

One-Time Setup (per developer)

Prerequisites

A NopeCHA API key — sign up at nopecha.com
Set it in your .env file: NOPECHA_API_KEY=your_key_here

Run the setup command:

python runner.py --setup-chrome-profile

Follow the on-screen instructions to install the extension and authenticate. After setup, all future runs will reuse this profile automatically.

Output (where results go)

CSV: output/<spider>_<timestamp>.csv
Finalized JSON: output/<spider>_<timestamp>.json
SQLite: output/listings.db
Job progress snapshots: output/job_stats/<job_id>.json
Logs: logs/<site>_<job_id>.log

Key listing fields

Required: title, price, location, bedrooms, bathrooms, property_type, listing_url, description
Metadata/flags: source_site, job_id, scraped_at, plus is_studio, is_auction, is_private_seller (when detectable)
Optional (depends on site selectors): agent_name, agent_phone, agency_name, listing_id, date_posted, erf_size, floor_size

Client-Facing Walkthrough (how this delivers results)

PropFlux is the kind of scraper I build when the goal is not just “get data once”, but to create a repeatable pipeline you can operate: run jobs, monitor progress, stop bad runs, and export clean outputs for analysis or downstream systems.

How it works

runner.py starts the scrape (Scrapy spider) and records a job in the SQLite-backed scrape_jobs table.
Spiders parse listing pages and emit raw items.
Pipelines normalize + deduplicate + export in batches (controlled by EXPORT_BATCH_SIZE).
For dynamic fields, BrowserService performs Selenium extraction using a persistent Chrome profile and NopeCHA (under MAX_CONCURRENT_BROWSERS, with RETRY_TIMES).
Scrapy updates lightweight progress snapshots in output/job_stats/<job_id>.json.
api/main.py exposes telemetry, logs, listings search, and job exports.
The dashboard provides an operator-friendly UI for starting/terminating jobs and exploring results.

How I would use this on your project

If you’re hiring me for web scraping + data pipelines + browser automation, I typically apply this same structure:

Start with site discovery and selector mapping (Scrapy + dynamic selectors).
Implement or extend a spider and pipelines so your output schema is consistent and validated.
Add resilience controls: retries, throttling, proxy strategy, and (when needed) Selenium extraction under a browser concurrency limit.
Ship monitoring: job lifecycle, telemetry, and log tailing, so you can safely run long scrapes without guessing.
Provide exports in the formats you need (CSV/JSON/SQLite) and optionally wire them to your downstream system.

Dashboard Documentation (what to click + what to expect)

The dashboard lives in dashboard/ and talks to the FastAPI backend.

Start the backend

python -m uvicorn api.main:app --reload --port 8000

Start the dashboard UI

cd dashboard
npm install
npm run dev

If your backend is not on localhost:8000, set: VITE_API_BASE_URL=http://<host>:<port>.

Main Control Panel

Target site: choose property24 or privateproperty
Start URL / Search query: optional override (falls back to site defaults)
Skip dynamic fields: when enabled, the scraper avoids Selenium dynamic extraction (faster; less complete)
Use engine settings (default OFF): controls whether settings_overrides are sent to the API when starting a job.
- When OFF: the job runs using the current defaults in scraper/settings.py and config/settings.py
- When ON: engine sliders apply to the next job you run
Run job: starts a background scrape and selects the new job in the UI
Terminate: stops an active job and updates job status/termination timestamps in the database
Live Console: streams the latest log lines for the selected job
Recent Jobs: quick selector + job status snapshots (job_id, timestamps, item counts)

Engine Settings tab (applies only when the main toggle is ON)

Concurrency / domain (CONCURRENT_REQUESTS_PER_DOMAIN)
Download delay (DOWNLOAD_DELAY)
Headless mode (HEADLESS)
Export batch size (EXPORT_BATCH_SIZE)
Max concurrent browsers (MAX_CONCURRENT_BROWSERS)
Retry times (RETRY_TIMES)

Analytics + Data Explorer

Analytics: charts for distribution and missing-field heatmaps (based on stored listings)
Data Explorer: searchable, paginated listing grid powered by /listings/query

Job History

Filter + pagination over /jobs/query
Per-job exports (CSV or prettified JSON) via /jobs/{job_id}/export

API endpoints (used by the dashboard)

GET / health check
POST /jobs/run start a job
POST /jobs/{job_id}/terminate stop a running job
GET /jobs/{job_id}/telemetry progress + runtime status
GET /jobs/{job_id}/logs?tail=<N> live log tail
GET /listings/query?limit=&offset=&site=&job_id=&q= search + paginate listings
GET /jobs/{job_id}/export?format=csv|json download results

Extending PropFlux to new websites

To add a new target site:

Add site configuration in config/sites.yaml (selectors, pagination strategy, dynamic selectors).
Create a spider in scraper/spiders/ that extends the base spider.
Register the spider in runner.py (SPIDER_MAP).

Built with ❤️ for reliable, scalable web scraping

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
api		api
assets		assets
config		config
core		core
dashboard		dashboard
scraper		scraper
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
SETUP.md		SETUP.md
requirements.txt		requirements.txt
runner.py		runner.py
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏠 PropFlux

Why clients pick this (quick pitch)

🎯 Features

📊 Current State

📋 Requirements

🚀 Quick Start

1. Setup

2. Run the Scraper

📞 Dynamic Data Extraction (Selenium + NopeCHA)

One-Time Setup (per developer)

Output (where results go)

Key listing fields

Client-Facing Walkthrough (how this delivers results)

How it works

How I would use this on your project

Dashboard Documentation (what to click + what to expect)

Start the backend

Start the dashboard UI

Main Control Panel

Engine Settings tab (applies only when the main toggle is ON)

Analytics + Data Explorer

Job History

API endpoints (used by the dashboard)

Extending PropFlux to new websites

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🏠 PropFlux

Why clients pick this (quick pitch)

🎯 Features

📊 Current State

📋 Requirements

🚀 Quick Start

1. Setup

2. Run the Scraper

📞 Dynamic Data Extraction (Selenium + NopeCHA)

One-Time Setup (per developer)

Output (where results go)

Key listing fields

Client-Facing Walkthrough (how this delivers results)

How it works

How I would use this on your project

Dashboard Documentation (what to click + what to expect)

Start the backend

Start the dashboard UI

Main Control Panel

Engine Settings tab (applies only when the main toggle is ON)

Analytics + Data Explorer

Job History

API endpoints (used by the dashboard)

Extending PropFlux to new websites

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages