NexusCrawl

An asynchronous, dual-transmission data mining, historical archival, and web reconnaissance engine. Built for civic audits, large-scale dataset extraction, deep media preservation, and offline AI-driven intelligence structuring.

Core Architecture

NexusCrawl utilizes a highly concurrent, hybrid event loop:

HTTPX (Standard Routing): High-speed, low-overhead async requests for static DOM parsing and HTTP HEAD reconnaissance.
Playwright (Heavy Routing): Headless Chromium integration for extracting JavaScript-rendered (React/Angular/Vue) data tables, interactive DOM elements, and executing client-side scripts before extraction.

Resiliency & Data Pipelines

Exponential Backoff Shield: A built-in RetryMiddleware that intercepts HTTP 429 (Rate Limit) and HTTP 403 (Forbidden) server drops, pauses the specific worker, and gracefully retries the connection without killing the primary crawl.
Asynchronous File Streaming: Utilizes aiofiles to prevent desktop RAM bottlenecks. Data is streamed directly to disk whether it is a .jsonl dictionary string, a cloned .css file, or a massive binary.
Dual-Routing SQL Exporter: Automatically routes extracted datasets into a local nexus_database.db SQLite database. It handles both raw web payloads (JSON row data) and refined intelligence models (parsed budget lines and meeting votes) simultaneously.
Structural PDF Exploiter & OCR Fallback: An offline, regex-hardened parser (pdfplumber) that rips tabular financial data from digital PDFs. If a document is a scanned "ghost" image, it automatically routes the file through a local Tesseract/Poppler OCR pipeline to force text extraction.
Offline LLM Structuring: Utilizes local, offline AI models (via Ollama) and instructor to read chaotic OCR text dumps and reconstruct them into mathematically perfect, structured Pydantic models (e.g., identifying specific parliamentary votes, motions, and financial impacts).
Stateful AI Checkpointing: Built-in SQLite state tracking ensures that long-running, CPU-bound LLM operations never repeat work. If the pipeline is paused or encounters a fatal edge case, it resumes exactly where it left off, bypassing previously completed pages.
Stream Interceptor: Offloads HLS/Blob streams to a background yt-dlp threading pipeline, automatically utilizing FFmpeg to decrypt and stitch streaming video chunks into native .mp4 files.

The Spider Matrix

CLI Name	Target File	Primary Function	Rendering Engine	Output Pipeline
`foia_hunter`	`spiders/civic_spider.py`	Deep-crawling, recursive pagination, and async `HEAD` probes to discover and extract hidden government documents (PDF, CSV, ZIP).	HTTPX	`AsyncMediaPipeline`
`table_miner`	`spiders/table_spider.py`	Takes control of the browser to flatten complex, paginated 2D JS data grids into structured dictionaries.	Playwright	`JsonLinesPipeline`
`media_archive`	`spiders/video_spider.py`	Two-phase deep driller. Scans directories, queues watch pages, and streams high-res `.mp4` / `.webm` binaries.	HTTPX	`AsyncMediaPipeline`
`web_recon`	`spiders/recon_spider.py`	Navigates to a target, executes client-side scripts, and clones the rendered HTML, CSS, and JS into a local repository.	Playwright + HTTPX	`SourceCodePipeline`

Installation & Setup

1. Install Python Dependencies

pip install -r requirements.txt

2. Install Headless Chromium & FFmpeg

Required for Playwright rendering and media stream stitching. Run these in an Administrator PowerShell:

playwright install chromium
winget install Gyan.FFmpeg

3. Install the OCR Engine (Tesseract & Poppler)

To process scanned, non-digital PDFs, NexusCrawl requires Tesseract and Poppler binaries locally.

Global Tesseract Install (Admin PowerShell)

winget install -e --id UB-Mannheim.TesseractOCR

Local Poppler Setup

Download the latest Poppler Windows release zip from oschwartz10612/poppler-windows.
Extract the core folder directly into the root of this repository and rename it to poppler.
Ensure the path poppler/Library/bin/pdftoppm.exe exists.

4. Install Offline AI Engine (Ollama)

Required for zero-cost, localized NLP and structured data extraction.

irm https://ollama.com/install.ps1 | iex
ollama run llama3.2

Execution Commands

NexusCrawl is driven entirely via the CLI using main.py and modular scripts.

The Crawler Operations

Run a spider on its default hardcoded target:

python main.py --spider table_miner

Override the default target with a custom URL:

python main.py --spider foia_hunter --url https://gilescountytn.gov/commission-minutes/"

The Intelligence Operations

Note: The following scripts are provided as highly specialized examples of how to query and structure the nexus_database.db vault. They are currently configured for civic audits (extracting budgets, roll-call votes, and parliamentary motions). However, because the core NexusCrawl engine structures raw data into an offline SQLite database, you can easily rewrite these intelligence scripts to audit any domain (e.g., corporate ledgers, medical abstracts, supply chain manifests, etc.) by simply changing the SQL queries and LLM prompts.

1. Extract Raw Intelligence & Budgets from PDFs

Run the mass-exploitation parser across all downloaded PDFs to extract raw text and structured financial ledgers into the local database.

python parsers/pdf_parser.py

2. Multi-Vector Intelligence Sweeps

Execute a regex-backed SQL hunt across the entire OCR database for specific keywords or aliases (e.g., tracking down hidden tax hikes or specific town funding).

python scripts/search_intel.py --keywords "wheel tax" "vehicle privilege tax" "registration fee"

3. Detonate the NLP Nuke (Structured AI Extraction)

Feed the raw OCR text into the local llama3.2 model to extract structured parliamentary roll-call votes. The engine automatically skips previously processed pages.

# Run a targeted extraction on a specific page
python scripts/nlp_nuke.py --file "20Oct25_22c877.pdf" --page 59

# Execute a mass extraction across all unprocessed targets
python scripts/nlp_nuke.py

4. Generate an Executive Audit Briefing

Command the offline AI to synthesize the raw parliamentary data into a high-level summary of overarching votes, financial highlights, and contracts.

# Summarize a specific document
python scripts/intel_summary.py --file "Minutes_3b8527.pdf"

# Generate a global briefing
python scripts/intel_summary.py

5. Export Intelligence to CSV

python scripts/export_csv.py

Data Output Structure

/nexus_database.db
/parsed_intel.db
/civic_audit_data.jsonl
/media/
/recon_vault/

`/nexus_database.db`

Relational SQLite database containing structured, queryable extractions:

civic_records & table_records (Live crawler payloads)
budget_items & meeting_votes (Refined intelligence extracted from PDFs)

`/parsed_intel.db`

Secondary SQLite database housing bulk analytical data:

extracted_tables (Raw tabular matrices wrapped in JSON)
extracted_text (Raw, searchable paragraph text)

`/media/`

Stores downloaded binary files (Images, Videos, PDFs).

`/recon_vault/`

Cloned website source code organized by target domain and file type.

Operational Use Cases

1. The Fiscal Audit (FOIA Hunter + PDF Exploiter)

Objective: Compare year-over-year county budget allocations.

Execution:

Run foia_hunter against a government portal.
Execute pdf_parser.py to extract financial data.
Export to CSV for visualization.

2. The Parliamentary Extractor (The NLP Nuke)

Objective: Track voting records across meeting minutes.

Execution:

Extract raw text into parsed_intel.db.
Run nlp_nuke.py with Ollama.
Output structured JSON of motions and votes.

3. The Executive Synthesizer (Intel Summary)

Objective: Generate high-level summaries of operations and finances.

Execution:

Run intel_summary.py.
Output Executive_Audit_Briefing.md.

4. The Digital Preservation (Web Recon + Media Archive)

Objective: Clone infrastructure and archive media before removal.

Execution:

Run web_recon for site cloning.
Run media_archive for stream capture and .mp4 generation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NexusCrawl

Core Architecture

Resiliency & Data Pipelines

The Spider Matrix

Installation & Setup

1. Install Python Dependencies

2. Install Headless Chromium & FFmpeg

3. Install the OCR Engine (Tesseract & Poppler)

Global Tesseract Install (Admin PowerShell)

Local Poppler Setup

4. Install Offline AI Engine (Ollama)

Execution Commands

The Crawler Operations

The Intelligence Operations

1. Extract Raw Intelligence & Budgets from PDFs

2. Multi-Vector Intelligence Sweeps

3. Detonate the NLP Nuke (Structured AI Extraction)

4. Generate an Executive Audit Briefing

5. Export Intelligence to CSV

Data Output Structure

`/nexus_database.db`

`/parsed_intel.db`

`/media/`

`/recon_vault/`

Operational Use Cases

1. The Fiscal Audit (FOIA Hunter + PDF Exploiter)

2. The Parliamentary Extractor (The NLP Nuke)

3. The Executive Synthesizer (Intel Summary)

4. The Digital Preservation (Web Recon + Media Archive)

About

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
core		core
parsers		parsers
public		public
scripts		scripts
spiders		spiders
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
models.py		models.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

NexusCrawl

Core Architecture

Resiliency & Data Pipelines

The Spider Matrix

Installation & Setup

1. Install Python Dependencies

2. Install Headless Chromium & FFmpeg

3. Install the OCR Engine (Tesseract & Poppler)

Global Tesseract Install (Admin PowerShell)

Local Poppler Setup

4. Install Offline AI Engine (Ollama)

Execution Commands

The Crawler Operations

The Intelligence Operations

1. Extract Raw Intelligence & Budgets from PDFs

2. Multi-Vector Intelligence Sweeps

3. Detonate the NLP Nuke (Structured AI Extraction)

4. Generate an Executive Audit Briefing

5. Export Intelligence to CSV

Data Output Structure

/nexus_database.db

/parsed_intel.db

/media/

/recon_vault/

Operational Use Cases

1. The Fiscal Audit (FOIA Hunter + PDF Exploiter)

2. The Parliamentary Extractor (The NLP Nuke)

3. The Executive Synthesizer (Intel Summary)

4. The Digital Preservation (Web Recon + Media Archive)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`/nexus_database.db`

`/parsed_intel.db`

`/media/`

`/recon_vault/`

Packages