PropFlux is a scalable real estate data extraction engine designed for resilient, multi-site scraping. Built with Scrapy, with Selenium + NopeCHA support for dynamic fields, and an admin dashboard that makes long-running jobs observable and controllable.
- You get a repeatable scraping pipeline: run jobs, monitor progress, and export clean results.
- Anti-bot resilience is built in: proxies, retry-aware crawling, and browser stealth/CAPTCHA solving.
- The project is operator-friendly: live logs, telemetry, termination controls, and search/exploration of results.
- Multi-site support: Architecture ready for scale, with high-fidelity support for Property24 and Private Property.
- Robust scraping: Automatic pagination, retry logic, and error handling.
- Data normalization: Standardizes prices, locations, and property details.
- Deduplication: Removes duplicate listings based on ID or URL.
- Multiple export formats: CSV, SQLite, and finalized JSON arrays.
- Memory-Efficient: Periodic flushing to disk for large-scale scraping.
- Stealth Infrastructure: Full proxy rotation and anti-bot bypassing (NopeCHA integration).
- Dynamic Content Support: Selenium-based extraction for JavaScript-heavy elements (agent details, phone numbers).
- Admin dashboard (monitoring + job control): React + FastAPI UI for telemetry, live logs, job termination, analytics, and data exploration.
PropFlux is running as a complete scraping + data pipeline system with: incremental exports (CSV/JSON/SQLite), optional Selenium/NopeCHA dynamic extraction, and a FastAPI + React dashboard for monitoring, job lifecycle control, and analytics. Ongoing work focuses on adding new targets and improving dashboard coverage as more fields are standardized.
- Python 3.11+
- Chrome Browser (for Selenium-based extraction)
- See
requirements.txtfor dependencies
# Clone or navigate to project directory
cd multi-site-real-estate-scraper
# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt# Scrape Property24 (default settings)
python runner.py --site property24
# Scrape Private Property
python runner.py --site privateproperty
# Hard limit total listings (useful for testing)
python runner.py --site property24 --limit 10
# Skip expensive Selenium dynamic extraction (for rapid testing of Scrapy logic)
python runner.py --site privateproperty --limit 5 --skip-dynamic-fields
# Use custom URL (search results or single listing)
python runner.py --site property24 --url "https://www.property24.com/for-sale/cape-town/western-cape/432"
# Verbose logging
python runner.py --site property24 --verboseSome sites hide agent details (names, phone numbers) behind JavaScript or CAPTCHAs. This project uses the BrowserService with a persistent Chrome profile to auto-solve CAPTCHAs via the NopeCHA extension.
The
chrome-profiles/directory is gitignored β your API key and session data are never committed.
Prerequisites
- A NopeCHA API key β sign up at nopecha.com
- Set it in your
.envfile:NOPECHA_API_KEY=your_key_here
Run the setup command:
python runner.py --setup-chrome-profileFollow the on-screen instructions to install the extension and authenticate. After setup, all future runs will reuse this profile automatically.
- CSV:
output/<spider>_<timestamp>.csv - Finalized JSON:
output/<spider>_<timestamp>.json - SQLite:
output/listings.db - Job progress snapshots:
output/job_stats/<job_id>.json - Logs:
logs/<site>_<job_id>.log
- Required:
title,price,location,bedrooms,bathrooms,property_type,listing_url,description - Metadata/flags:
source_site,job_id,scraped_at, plusis_studio,is_auction,is_private_seller(when detectable) - Optional (depends on site selectors):
agent_name,agent_phone,agency_name,listing_id,date_posted,erf_size,floor_size
PropFlux is the kind of scraper I build when the goal is not just βget data onceβ, but to create a repeatable pipeline you can operate: run jobs, monitor progress, stop bad runs, and export clean outputs for analysis or downstream systems.
runner.pystarts the scrape (Scrapy spider) and records a job in the SQLite-backedscrape_jobstable.- Spiders parse listing pages and emit raw items.
- Pipelines normalize + deduplicate + export in batches (controlled by
EXPORT_BATCH_SIZE). - For dynamic fields,
BrowserServiceperforms Selenium extraction using a persistent Chrome profile and NopeCHA (underMAX_CONCURRENT_BROWSERS, withRETRY_TIMES). - Scrapy updates lightweight progress snapshots in
output/job_stats/<job_id>.json. api/main.pyexposes telemetry, logs, listings search, and job exports.- The dashboard provides an operator-friendly UI for starting/terminating jobs and exploring results.
If youβre hiring me for web scraping + data pipelines + browser automation, I typically apply this same structure:
- Start with site discovery and selector mapping (Scrapy + dynamic selectors).
- Implement or extend a spider and pipelines so your output schema is consistent and validated.
- Add resilience controls: retries, throttling, proxy strategy, and (when needed) Selenium extraction under a browser concurrency limit.
- Ship monitoring: job lifecycle, telemetry, and log tailing, so you can safely run long scrapes without guessing.
- Provide exports in the formats you need (CSV/JSON/SQLite) and optionally wire them to your downstream system.
The dashboard lives in dashboard/ and talks to the FastAPI backend.
python -m uvicorn api.main:app --reload --port 8000cd dashboard
npm install
npm run devIf your backend is not on localhost:8000, set:
VITE_API_BASE_URL=http://<host>:<port>.
- Target site: choose
property24orprivateproperty - Start URL / Search query: optional override (falls back to site defaults)
- Skip dynamic fields: when enabled, the scraper avoids Selenium dynamic extraction (faster; less complete)
- Use engine settings (default OFF): controls whether
settings_overridesare sent to the API when starting a job.- When OFF: the job runs using the current defaults in
scraper/settings.pyandconfig/settings.py - When ON: engine sliders apply to the next job you run
- When OFF: the job runs using the current defaults in
- Run job: starts a background scrape and selects the new job in the UI
- Terminate: stops an active job and updates job status/termination timestamps in the database
- Live Console: streams the latest log lines for the selected job
- Recent Jobs: quick selector + job status snapshots (
job_id, timestamps, item counts)
- Concurrency / domain (
CONCURRENT_REQUESTS_PER_DOMAIN) - Download delay (
DOWNLOAD_DELAY) - Headless mode (
HEADLESS) - Export batch size (
EXPORT_BATCH_SIZE) - Max concurrent browsers (
MAX_CONCURRENT_BROWSERS) - Retry times (
RETRY_TIMES)
- Analytics: charts for distribution and missing-field heatmaps (based on stored listings)
- Data Explorer: searchable, paginated listing grid powered by
/listings/query
- Filter + pagination over
/jobs/query - Per-job exports (CSV or prettified JSON) via
/jobs/{job_id}/export
GET /health checkPOST /jobs/runstart a jobPOST /jobs/{job_id}/terminatestop a running jobGET /jobs/{job_id}/telemetryprogress + runtime statusGET /jobs/{job_id}/logs?tail=<N>live log tailGET /listings/query?limit=&offset=&site=&job_id=&q=search + paginate listingsGET /jobs/{job_id}/export?format=csv|jsondownload results
To add a new target site:
- Add site configuration in
config/sites.yaml(selectors, pagination strategy, dynamic selectors). - Create a spider in
scraper/spiders/that extends the base spider. - Register the spider in
runner.py(SPIDER_MAP).
Built with β€οΈ for reliable, scalable web scraping