LLM-powered tooling for triaging Epstein-related document corpora. Processes PDFs, office documents, and audio/video files through vision-language models, scoring each for investigative significance. Ships a dashboard for filtering, charting, and inspecting scored documents.
Default model: qwen/qwen3-vl-30b-a3b-thinking via OpenRouter (free, hosted by Alibaba). No API costs, no local GPU required.
| Table View | Insights & Charts |
|---|---|
![]() |
![]() |
| Methodology Explainer |
|---|
![]() |
- Install deps:
pip install -r requirements.txt # just `requests`- Set up your OpenRouter key (free account at openrouter.ai):
cp .env.template.openrouter .env.openrouter
# Edit .env.openrouter with your key:
# OPENROUTER_API_KEY='sk-or-...'- Run:
./run_ranker.sh --volumes 1 # process volume 1
./run_ranker.sh --volumes 1,2,6-8 # multiple volumes
./run_ranker.sh --volumes all # all available volumes
./run_ranker.sh --volumes 1 --dry-run # preview without processingThat's it. The script auto-loads .env.openrouter, connects to OpenRouter, and writes Git-trackable results to contrib/fta/.
The primary pipeline. Renders PDFs as images, sends them to the VLM for analysis.
# Cloud only (default)
./run_ranker.sh --volumes 11
# Hybrid: cloud + local fallback
./run_hybrid_volume.sh --volume 11
# Local only
./run_ranker.sh --volumes 11 --provider local --endpoint http://localhost:5555/v1Key options: --parallel N, --max-rows N (smoke test), --start-pdf / --end-pdf (split work), --retry-failed-local.
Processes Excel, Word, PowerPoint, and other office docs via LibreOffice PDF conversion + the same VLM pipeline.
./run_office_ranker.sh --volume 11Processes video and audio files. Extracts frames into 2×2 grid composites (4× token savings), transcribes audio locally via Whisper (small model).
./run_av_ranker.sh --volume 11 # cloud
./run_av_ranker.sh --volume 11 --endpoint http://localhost:5555/v1 # local
./run_av_ranker.sh --volume 11 --max-files 3 # smoke test
./run_av_ranker.sh --volume 11 --no-transcription # frames onlyKey options: --fps N, --max-frames N, --grid-cols/rows N, --no-grid, --whisper-model SIZE.
All pipelines support local inference via LM Studio or any OpenAI-compatible server:
# Serve qwen/qwen3-vl-30b-a3b-thinking in LM Studio on port 5555, then:
./run_ranker.sh --volumes 1 --provider local --endpoint http://localhost:5555/v1OPENROUTER_API_KEY='sk-or-...'
OPENROUTER_REFERER='https://epsteingate.org'
OPENROUTER_TITLE='Epstein File Ranker'
OPENROUTER_PROVIDER='alibaba'
OPENROUTER_NO_FALLBACKS='1'Copy ranker_config.example.toml and customize. CLI flags override TOML values.
python gpt_ranker.py --prompt-file prompts/my_custom_prompt.txtSee prompts/README.md for details.
- DOJ FTA corpus (primary): Epstein-Files GitHub — raw PDFs and multimedia under
data/new_data/VOL00001... - StandardWorks index: standardworks.ai/epstein-files
- Legacy OCR dataset: tensonaut/EPSTEIN_FILES_20K
For detailed information on how these files are organized (e.g., NATIVES vs OCR), please see the Data Directory README.
./viewer.sh 9000 # or: cd viewer && python -m http.server 9000Open http://localhost:9000 — AG Grid table with score filtering, charts, power mention analysis, and full document text inspection.
| File | Purpose |
|---|---|
gpt_ranker.py |
Main orchestration pipeline |
av_ranker.py |
Audio/video processing pipeline |
office_ranker.py |
Office document processing pipeline |
ranker/cli.py |
CLI parsing + config resolution |
ranker/model_client.py |
API client, retries, vision request building |
ranker/constants.py |
Canonical maps and shared constants |
run_ranker.sh |
PDF processing wrapper (cloud by default) |
run_hybrid_volume.sh |
Hybrid cloud+local processing wrapper |
run_av_ranker.sh |
AV processing wrapper |
run_office_ranker.sh |
Office document processing wrapper |
Documents are scored 0–100 based on investigative significance:
| Range | Meaning |
|---|---|
| 0–10 | Noise, duplicates, no actionable info |
| 10–30 | Weak leads, speculative |
| 30–50 | Moderate leads, partial details |
| 50–70 | Strong leads, actionable info |
| 70–85 | High-impact revelations |
| 85–100 | Blockbuster evidence |
The corpus contains sensitive content (abuse, trafficking, violence, unverified allegations). Scores prioritize leads for human review — this project does not assert the veracity of any individual document.
This project is licensed under CC BY-SA 4.0 (Creative Commons Attribution-ShareAlike 4.0 International). You must give appropriate credit and distribute derivative works under the same license.


