An asynchronous, dual-transmission data mining, historical archival, and web reconnaissance engine. Built for civic audits, large-scale dataset extraction, deep media preservation, and offline AI-driven intelligence structuring.
NexusCrawl utilizes a highly concurrent, hybrid event loop:
- HTTPX (Standard Routing): High-speed, low-overhead async requests for static DOM parsing and HTTP
HEADreconnaissance. - Playwright (Heavy Routing): Headless Chromium integration for extracting JavaScript-rendered (React/Angular/Vue) data tables, interactive DOM elements, and executing client-side scripts before extraction.
- Exponential Backoff Shield: A built-in
RetryMiddlewarethat intercepts HTTP429(Rate Limit) and HTTP403(Forbidden) server drops, pauses the specific worker, and gracefully retries the connection without killing the primary crawl. - Asynchronous File Streaming: Utilizes
aiofilesto prevent desktop RAM bottlenecks. Data is streamed directly to disk whether it is a.jsonldictionary string, a cloned.cssfile, or a massive binary. - Dual-Routing SQL Exporter: Automatically routes extracted datasets into a local
nexus_database.dbSQLite database. It handles both raw web payloads (JSON row data) and refined intelligence models (parsed budget lines and meeting votes) simultaneously. - Structural PDF Exploiter & OCR Fallback: An offline, regex-hardened parser (
pdfplumber) that rips tabular financial data from digital PDFs. If a document is a scanned "ghost" image, it automatically routes the file through a local Tesseract/Poppler OCR pipeline to force text extraction. - Offline LLM Structuring: Utilizes local, offline AI models (via Ollama) and
instructorto read chaotic OCR text dumps and reconstruct them into mathematically perfect, structured Pydantic models (e.g., identifying specific parliamentary votes, motions, and financial impacts). - Stateful AI Checkpointing: Built-in SQLite state tracking ensures that long-running, CPU-bound LLM operations never repeat work. If the pipeline is paused or encounters a fatal edge case, it resumes exactly where it left off, bypassing previously completed pages.
- Stream Interceptor: Offloads HLS/Blob streams to a background
yt-dlpthreading pipeline, automatically utilizing FFmpeg to decrypt and stitch streaming video chunks into native.mp4files.
| CLI Name | Target File | Primary Function | Rendering Engine | Output Pipeline |
|---|---|---|---|---|
foia_hunter |
spiders/civic_spider.py |
Deep-crawling, recursive pagination, and async HEAD probes to discover and extract hidden government documents (PDF, CSV, ZIP). |
HTTPX | AsyncMediaPipeline |
table_miner |
spiders/table_spider.py |
Takes control of the browser to flatten complex, paginated 2D JS data grids into structured dictionaries. | Playwright | JsonLinesPipeline |
media_archive |
spiders/video_spider.py |
Two-phase deep driller. Scans directories, queues watch pages, and streams high-res .mp4 / .webm binaries. |
HTTPX | AsyncMediaPipeline |
web_recon |
spiders/recon_spider.py |
Navigates to a target, executes client-side scripts, and clones the rendered HTML, CSS, and JS into a local repository. | Playwright + HTTPX | SourceCodePipeline |
pip install -r requirements.txtRequired for Playwright rendering and media stream stitching. Run these in an Administrator PowerShell:
playwright install chromium
winget install Gyan.FFmpegTo process scanned, non-digital PDFs, NexusCrawl requires Tesseract and Poppler binaries locally.
winget install -e --id UB-Mannheim.TesseractOCR- Download the latest Poppler Windows release zip from
oschwartz10612/poppler-windows. - Extract the core folder directly into the root of this repository and rename it to
poppler. - Ensure the path
poppler/Library/bin/pdftoppm.exeexists.
Required for zero-cost, localized NLP and structured data extraction.
irm https://ollama.com/install.ps1 | iex
ollama run llama3.2NexusCrawl is driven entirely via the CLI using main.py and modular scripts.
Run a spider on its default hardcoded target:
python main.py --spider table_minerOverride the default target with a custom URL:
python main.py --spider foia_hunter --url https://gilescountytn.gov/commission-minutes/"Note: The following scripts are provided as highly specialized examples of how to query and structure the
nexus_database.dbvault. They are currently configured for civic audits (extracting budgets, roll-call votes, and parliamentary motions). However, because the core NexusCrawl engine structures raw data into an offline SQLite database, you can easily rewrite these intelligence scripts to audit any domain (e.g., corporate ledgers, medical abstracts, supply chain manifests, etc.) by simply changing the SQL queries and LLM prompts.
Run the mass-exploitation parser across all downloaded PDFs to extract raw text and structured financial ledgers into the local database.
python parsers/pdf_parser.pyExecute a regex-backed SQL hunt across the entire OCR database for specific keywords or aliases (e.g., tracking down hidden tax hikes or specific town funding).
python scripts/search_intel.py --keywords "wheel tax" "vehicle privilege tax" "registration fee"Feed the raw OCR text into the local llama3.2 model to extract structured parliamentary roll-call votes. The engine automatically skips previously processed pages.
# Run a targeted extraction on a specific page
python scripts/nlp_nuke.py --file "20Oct25_22c877.pdf" --page 59
# Execute a mass extraction across all unprocessed targets
python scripts/nlp_nuke.pyCommand the offline AI to synthesize the raw parliamentary data into a high-level summary of overarching votes, financial highlights, and contracts.
# Summarize a specific document
python scripts/intel_summary.py --file "Minutes_3b8527.pdf"
# Generate a global briefing
python scripts/intel_summary.pypython scripts/export_csv.py/nexus_database.db
/parsed_intel.db
/civic_audit_data.jsonl
/media/
/recon_vault/
Relational SQLite database containing structured, queryable extractions:
civic_records&table_records(Live crawler payloads)budget_items&meeting_votes(Refined intelligence extracted from PDFs)
Secondary SQLite database housing bulk analytical data:
extracted_tables(Raw tabular matrices wrapped in JSON)extracted_text(Raw, searchable paragraph text)
Stores downloaded binary files (Images, Videos, PDFs).
Cloned website source code organized by target domain and file type.
Objective: Compare year-over-year county budget allocations.
Execution:
- Run
foia_hunteragainst a government portal. - Execute
pdf_parser.pyto extract financial data. - Export to CSV for visualization.
Objective: Track voting records across meeting minutes.
Execution:
- Extract raw text into
parsed_intel.db. - Run
nlp_nuke.pywith Ollama. - Output structured JSON of motions and votes.
Objective: Generate high-level summaries of operations and finances.
Execution:
- Run
intel_summary.py. - Output
Executive_Audit_Briefing.md.
Objective: Clone infrastructure and archive media before removal.
Execution:
- Run
web_reconfor site cloning. - Run
media_archivefor stream capture and.mp4generation.
