A deterministic claim extraction and evidence discovery engine.
Veritas extracts claims from audio, video, text, and PDF sources using NLP — no large language models — then discovers supporting evidence from 29 free structured APIs using rule-based methods. Every claim maps to primary sources with full scoring transparency.
Zero LLM dependency. Zero hallucination risk. Built for podcasters, journalists, and researchers who need to verify what was said.
Veritas takes any audio, video, text, or PDF source and runs it through a fully deterministic pipeline:
- Ingest — downloads audio via yt-dlp, reads text/PDF files, fetches web articles (trafilatura extraction), or pulls YouTube captions instantly without GPU
- Transcribe — GPU-accelerated speech-to-text using faster-whisper (CTranslate2 / CUDA), or direct text-to-segment conversion for document intake. YouTube captions bypass transcription entirely.
- Extract Claims — rule-based NLP identifies checkable factual statements from the transcript. No LLM, no prompt engineering — uses sentence boundary detection, named entity recognition, assertion verb patterns, and signal scoring. Optional spaCy enhancement for better sentence splitting, NER, and subject detection
- Categorize — context-aware keyword classification across 12 categories routes each claim to the most relevant evidence sources. Source metadata (title, channel) influences categorization so claims inherit context from their source
- Verify — smart routing with 13 content-aware signals sends claims to free, structured APIs. BM25Okapi scoring with entity matching, number matching (±5% tolerance), keyphrase alignment, temporal awareness, and evidence type weighting. Query variant fallback retries with alternative queries when primary returns nothing. Strict guardrails prevent false positives
- Cluster — a knowledge graph fingerprints claims, groups them by category and numeric/entity content, then clusters related claims across sources using NetworkX Louvain community detection (falls back to Union-Find if NetworkX unavailable). PageRank centrality selects cluster representatives. Cross-source clusters get consensus scoring (strong/moderate/weak/insufficient). D3-compatible JSON export for frontend visualization
The result: a structured database of claims, each linked to candidate evidence with full scoring transparency, plus a knowledge graph showing how claims relate across sources.
Veritas takes a fundamentally different approach from LLM-based fact-checkers:
- Extraction is deterministic — the same transcript always produces the same claims. No temperature, no sampling, no prompt sensitivity
- Verification is rule-based — scoring functions use BM25Okapi textual relevance, entity matching, exact number matching (±5% tolerance), and evidence type classification. No embeddings, no semantic similarity
- Evidence comes from primary sources — SEC filings, academic papers, government datasets, market data, fact-check organizations. Not web search, not LLM-generated summaries
- Temporal awareness — claims with dates are matched against time-relevant evidence; stale data is penalized
- Cross-source consensus — when multiple independent sources agree, confidence increases. Consensus scoring analyzes agreement across all evidence
- Unknown is the default — if the evidence APIs return nothing relevant, the claim stays UNKNOWN. Veritas never guesses
| Status | Conditions |
|---|---|
| SUPPORTED | Score >= 85 with primary source + BM25/token overlap + keyphrase or exact number match |
| PARTIAL | Score 70-84, or high score missing some signal requirements |
| UNKNOWN | Everything else (the honest default) |
CONTRADICTED is never set automatically — too risky for an automated system. Finance claims have additional guardrails requiring specific financial metric matches, not just entity name overlap.
29 free APIs. No API keys required (optional keys for higher rate limits on FRED, Google Fact Check, CourtListener, OpenSanctions, and OpenCorporates).
| Source | Type | Best For |
|---|---|---|
| SEC EDGAR | filing |
Company financials, earnings, 10-K/10-Q/8-K filings |
| SEC Gov | gov |
SEC publications, reports, and regulatory documents |
| yfinance | dataset |
Real-time market data, stock prices, market cap, revenue |
| FRED | dataset |
Macroeconomic indicators — GDP, CPI, unemployment, federal funds rate |
| U.S. Treasury | gov |
Federal debt, revenue, spending, fiscal data |
| BLS | gov |
Labor statistics — employment, wages, CPI, PPI |
| CBO | gov |
Congressional Budget Office reports and projections |
| USASpending | gov |
Federal government spending and contract awards |
| Census | gov |
Population, demographics, housing, income statistics (ACS 2023) |
| World Bank | dataset |
International development indicators across 200+ countries |
| OpenFDA | gov |
Drug safety, adverse events, device recalls |
| PatentsView | dataset |
USPTO patent and invention data |
| Crossref | paper |
Academic papers across all fields (DOI-linked) |
| arXiv | paper |
AI/ML, physics, mathematics, computer science preprints |
| PubMed | paper |
Biomedical and health research (PMID-linked) |
| Semantic Scholar | paper |
AI-curated academic search across all disciplines |
| Wikipedia | secondary |
Named entity context, background reference |
| Wikidata | dataset |
Structured knowledge base — entities, relationships, facts |
| DuckDuckGo | search |
General web search fallback for uncategorized claims |
| Google Fact Check | factcheck |
Verified fact-checks from PolitiFact, Snopes, Full Fact, AFP, Reuters, and IFCN-certified publishers |
| Congress.gov | gov |
Congressional bills, legislation, and legislative activity |
| GovInfo | gov |
U.S. Government Publishing Office — federal documents |
| FEC | gov |
Federal Election Commission — campaign finance data |
| OpenStates | gov |
State legislature bills, votes, and legislator data |
| WHO GHO | dataset |
World Health Organization Global Health Observatory indicators |
| CourtListener | gov |
Court opinions and RECAP dockets for legal claims |
| OpenSanctions | dataset |
Sanctions and politically exposed persons (PEP) entity matching |
| OpenCorporates | dataset |
Corporate registry data across jurisdictions |
| Local Datasets | dataset |
Curated CSV datasets (FRED historical, corporate financials) for offline matching |
Smart routing uses 15 content-aware signals to optimize source ordering per claim:
- Company mentions boost yfinance + SEC EDGAR
- Academic language boosts arXiv + Crossref + Semantic Scholar
- Health/clinical terms boost PubMed + OpenFDA + WHO GHO
- Financial metrics boost yfinance + SEC EDGAR + FRED + Treasury
- Drug/pharmaceutical terms boost OpenFDA
- Labor/employment terms boost BLS
- Budget/spending terms boost CBO + USASpending + Treasury
- Demographics terms boost Census
- International indicators boost World Bank
- Patent/invention terms boost PatentsView
- Legislative/political terms boost Congress.gov + GovInfo + FEC + OpenStates
- Legal terms boost Congress.gov + GovInfo + SEC Gov + CourtListener
- Sanctions/PEP terms boost OpenSanctions
- Corporate registry terms boost OpenCorporates
When a primary search query returns no results, Veritas automatically generates alternative queries using three zero-LLM transforms:
- Entity-focused — extracts proper nouns, numbers, and key terms
- Synonym swap — replaces verbs with synonyms (e.g., "increased" → "grew")
- Nominalization — converts verbs to noun forms (e.g., "increased" → "increase in")
Veritas ingests content from multiple sources with intelligent routing:
| Intake Path | Description |
|---|---|
| YouTube URL | Auto-detected → fast caption extraction (no GPU needed) |
| YouTube audio | Falls back to GPU transcription when captions unavailable |
| Web URL | Article text extracted via trafilatura (3-stage cascade) with metadata extraction |
| PDF URL | Auto-detected → downloaded and text extracted |
| Plain text | .txt files — read and segmented automatically |
| PDF file | Requires PyMuPDF or pdfplumber — text extracted and segmented |
| Raw text | Inline text string via CLI or API |
| RSS/Atom feeds | Parses feeds and creates sources from entries |
| SRT/VTT | Subtitle/caption file parsing |
- URL-type routing — YouTube, PDF, and HTML URLs are auto-detected and routed to the appropriate extractor
- trafilatura extraction — 3-stage cascade (custom XPath → JusText/Readability → baseline) for high-quality article text
- HTML metadata extraction — author, published date, site name, and keywords extracted from
<meta>tags - Image alt-text extraction —
altattributes and<figcaption>content appended as supplementary text for claim extraction - Content quality gate — paywall messages, 404 pages, and cookie consent text are rejected before processing
Veritas clusters related claims across sources to find consensus and contradiction:
- Fingerprinting — each claim is tokenized and numeric values extracted for comparison
- Blocking — claims are grouped by category + shared numbers AND category + shared entities to reduce comparison space
- Clustering — NetworkX Louvain community detection groups related claims (falls back to Union-Find if NetworkX unavailable)
- PageRank centrality — selects the most representative claim in each cluster
- Modularity scoring — measures cluster quality
- Cross-source only — clusters require claims from different sources (same-source repeats are filtered)
- Consensus scoring — analyzes agreement across evidence sources using inverse normalized standard deviation. Returns strong/moderate/weak/insufficient status
- D3 export —
graph_to_json()exports clusters as D3-compatible JSON for frontend visualization
Veritas tracks how claims move across sources:
- Spread Analysis — identifies the same claim appearing across multiple videos or channels, scored by global content hash and fuzzy matching
- Timeline Tracking — maps when claims first appear and how they propagate
- Top Claims Ranking — surfaces the most-repeated claims across your entire corpus, ranked by cross-source frequency
- Contradiction Detection — planned: will flag cases where sources make conflicting factual assertions about the same topic
RESTful API wrapping the full pipeline. Supports SSE for real-time progress updates during long evidence discovery runs.
uvicorn veritas.api:app --host 0.0.0.0 --port 8000Endpoints: submit (YouTube URL, text, file upload), sources, claims, evidence, clusters, search, stats, Quick Check (single claim), and ClaimReview schema.org export.
Dark command-center aesthetic with 3 screens:
- Home — submit (YouTube URL, text, file upload) + dashboard stats
- Results — claims, evidence, clusters, and search in a consolidated view
- Vault — saved and verified claims feed
cd web && npm install && npm run devMV3 browser extension for inline fact-checking:
- Right-click context menu + keyboard shortcut (Ctrl+Shift+V)
- Shadow DOM panel isolation
- Result caching (chrome.storage.local, 24h TTL)
- Quick links to Snopes, Wikipedia, Google Scholar
- ARIA accessibility labels
Generates Schema.org ClaimReview JSON-LD markup that publishers can embed in their pages. When embedded, claims appear in Google search results as fact-checks.
| Component | Technology |
|---|---|
| Language | Python 3.12+ |
| Transcription | faster-whisper (CTranslate2, CUDA-accelerated) |
| Audio download | yt-dlp |
| Database | SQLite (single-file, zero-config) |
| API server | FastAPI + SSE |
| Web frontend | React + Vite + Tailwind CSS |
| CLI | Click + Rich |
| NLP | Rule-based + optional spaCy (en_core_web_sm) for enhanced sentence splitting, NER, subject detection |
| Text extraction | trafilatura (HTML article extraction) |
| Scoring | BM25Okapi + rule-based (0-100 scale) |
| HTTP client | requests (29 free sources) |
| Caching | SQLite-backed API response cache (per-source TTL) |
| GPU support | NVIDIA CUDA 12 via pip (nvidia-cublas-cu12, nvidia-cudnn-cu12) |
| Testing | pytest (641 tests) |
# Clone and set up
git clone https://github.com/Obelus-Labs-LLC/Veritas.git
cd Veritas
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # Linux/Mac
# Install
pip install -e .
# (Optional) Install CUDA-enabled PyTorch for GPU transcription
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
# (Optional) Install spaCy for enhanced NLP (better sentence splitting, NER, subject detection)
pip install "veritas[nlp]"
python -m spacy download en_core_web_sm
# Run the pipeline on a YouTube video (uses fast caption extraction)
python -m veritas ingest "https://www.youtube.com/watch?v=EXAMPLE"
python -m veritas claims <source_id>
python -m veritas assist <source_id>
# Or ingest text/PDF directly (no audio, no transcription)
python -m veritas ingest-text path/to/document.pdf
python -m veritas claims <source_id>
python -m veritas assist <source_id>
# Or ingest a web article (auto-extracts article text + metadata)
python -m veritas ingest-url "https://example.com/article"
python -m veritas claims <source_id>
python -m veritas assist <source_id>
# Build the knowledge graph
python -m veritas build-graph
# Check results
python -m veritas sources
python -m veritas queue
python -m veritas inspect-verified --status supported --verbose
python -m veritas clusters
python -m veritas export <source_id> --format md
# Run the web frontend
uvicorn veritas.api:app --port 8000 # API server
cd web && npm run dev # React frontend on :5173| Command | Description |
|---|---|
python -m veritas ingest <url> |
Download audio, save metadata, register source |
python -m veritas ingest-text <path> |
Ingest a text or PDF file directly (no audio) |
python -m veritas ingest-url <url> |
Ingest a web article URL (auto-detects YouTube/PDF/HTML) |
python -m veritas transcribe <id> |
Transcribe audio with faster-whisper (GPU) |
python -m veritas claims <id> |
Extract candidate claims (deterministic, rule-based) |
python -m veritas assist <id> |
Auto-discover evidence from 29 free APIs |
python -m veritas build-graph |
Build the knowledge graph (fingerprint, cluster, consensus) |
python -m veritas clusters |
Show top claim clusters from the knowledge graph |
python -m veritas cluster <id> |
Show detailed view of a single cluster and its members |
python -m veritas queue |
Show claims needing review, sorted by priority |
python -m veritas review <id> |
Interactively review and verify claims |
python -m veritas verify <claim_id> |
Set status and attach evidence for a single claim |
python -m veritas inspect-verified |
Inspect auto-verified claims with evidence and signals |
python -m veritas export <id> |
Generate Markdown or JSON brief with provenance labels |
python -m veritas search "<query>" |
Full-text search across all claims |
python -m veritas sources |
List all ingested sources with verification metrics |
python -m veritas spread <claim_hash> |
Show where a claim appears across sources |
python -m veritas timeline <claim_hash> |
Chronological propagation of a claim |
python -m veritas top-claims |
Most-repeated claims across all sources |
python -m veritas doctor |
Check environment, GPU, and dependency status |
Veritas uses deterministic rules — no AI, no API calls — to identify checkable statements:
-
Segment Stitching — Whisper outputs segments with arbitrary boundaries. Veritas merges adjacent segments into windows so complete sentences can be recovered across boundaries.
-
Sentence Splitting — The stitched window is split at punctuation boundaries. Fragments shorter than 7 words or 40 characters are rejected. Claims are capped at 240 characters.
-
Candidate Detection — A sentence becomes a candidate claim if it contains a signal (numbers, dates, named entities, or assertion verbs) AND has a subject-like anchor (proper noun, pronoun, or number).
-
Fragment Filtering — Dangling clauses starting with conjunctions are rejected. YouTube and podcast boilerplate is filtered out (sponsor messages, review requests, self-references).
-
Classification — Each claim gets confidence language (hedged/definitive/unknown), a category (12 categories), and a pipe-delimited signal log showing exactly which rules fired.
-
Deduplication — Two-layer dedup: SHA256 hash for exact matches (local and global), plus SequenceMatcher (0.78 threshold) for near-duplicates.
| Category | Routes To |
|---|---|
| finance | yfinance, SEC EDGAR, SEC Gov, FRED, Treasury, BLS, CBO, USASpending, OpenCorporates, Google Fact Check, Crossref, Wikipedia |
| health | WHO GHO, PubMed, OpenFDA, Google Fact Check, Crossref, Semantic Scholar, Wikipedia |
| science | arXiv, Semantic Scholar, Crossref, PubMed, World Bank, Wikipedia |
| tech | arXiv, Crossref, PatentsView, OpenCorporates, Google Fact Check, Wikipedia |
| politics | Congress.gov, Google Fact Check, FEC, GovInfo, OpenStates, OpenSanctions, Treasury, SEC Gov, CBO, USASpending, Wikipedia |
| military | Google Fact Check, Congress.gov, USASpending, GovInfo, OpenSanctions, Crossref, Wikipedia |
| education | Census, World Bank, Crossref, Google Fact Check, Semantic Scholar, Wikipedia |
| energy_climate | World Bank, Crossref, arXiv, Google Fact Check, Wikipedia |
| labor | BLS, FRED, Census, Google Fact Check, Crossref, Wikipedia |
| history_culture | Wikipedia, Wikidata, Crossref, Semantic Scholar, DuckDuckGo, Google Fact Check |
| legal | CourtListener, Google Fact Check, Congress.gov, GovInfo, Wikipedia, CBO, Crossref, SEC Gov |
| general | All 29 sources in fixed order |
Veritas is being developed alongside WeThePeople, a civic transparency platform. They are separate projects today with a planned integration path:
- WeThePeople collects and organizes public political content — congressional hearings, campaign speeches, policy debates
- Veritas provides the verification layer — extracting claims and checking them against primary sources
- The integration path: WeThePeople sends politician hearing clips and speech transcripts to Veritas for automated claim extraction and evidence verification, then surfaces the results to citizens
Together, they form a pipeline from raw political speech to verified, evidence-linked claims — with full transparency at every step.
pip install -e ".[dev]"
python -m pytest tests/ -q --tb=short641 tests. Tests use fixture transcripts and mocked APIs — no network calls or GPU required.
veritas-app/
├── src/veritas/
│ ├── cli.py # Click CLI (20 commands)
│ ├── api.py # FastAPI service layer (REST + SSE)
│ ├── ingest.py # Audio download (yt-dlp)
│ ├── ingest_text.py # Text/PDF/URL ingestion (trafilatura, URL-type routing)
│ ├── ingest_captions.py # Fast YouTube caption extraction (no GPU)
│ ├── transcript_parser.py # SRT/VTT subtitle parsing
│ ├── rss_ingest.py # RSS/Atom feed ingestion
│ ├── transcribe.py # Speech-to-text (faster-whisper)
│ ├── claim_extract.py # Deterministic claim extraction (12 categories)
│ ├── assist.py # Smart routing + evidence orchestration (15 signals)
│ ├── scoring.py # BM25Okapi scoring + consensus (0-100)
│ ├── query_variants.py # Zero-LLM query variant generation
│ ├── url_normalize.py # URL normalization + evidence deduplication
│ ├── knowledge_graph.py # NetworkX Louvain clustering, PageRank, D3 export
│ ├── verdict.py # Template-based verdict summaries for journalists
│ ├── claimreview.py # Schema.org ClaimReview JSON-LD export
│ ├── job_queue.py # SQLite job queue (pool routing, graceful shutdown)
│ ├── evidence_sources/ # 29 free API integrations
│ │ ├── base.py # Shared HTTP: rate limiting, caching, backoff
│ │ ├── crossref.py ├── arxiv.py
│ │ ├── pubmed.py ├── semantic_scholar.py
│ │ ├── sec_edgar.py ├── sec_gov.py
│ │ ├── yfinance_source.py ├── fred_source.py
│ │ ├── treasury.py ├── wikipedia_source.py
│ │ ├── wikidata.py ├── duckduckgo.py
│ │ ├── google_factcheck.py ├── openfda.py
│ │ ├── bls.py ├── cbo.py
│ │ ├── usaspending.py ├── census.py
│ │ ├── worldbank.py ├── patentsview.py
│ │ ├── congress.py ├── govinfo.py
│ │ ├── fec.py ├── openstates.py
│ │ ├── who_gho.py ├── courtlistener.py
│ │ ├── opensanctions.py ├── opencorporates.py
│ │ └── local_datasets.py
│ ├── verify.py # Human claim verification
│ ├── db.py # SQLite schema + migrations
│ ├── models.py # Data models
│ ├── config.py # Constants and paths
│ ├── paths.py # Directory path helpers
│ ├── export.py # Markdown/JSON brief generation
│ ├── search.py # Full-text claim search
│ └── doctor.py # Environment health checks
├── web/ # React + Vite + Tailwind frontend
│ ├── src/pages/ # 3 screens (Home, Results, Vault)
│ ├── src/lib/ # API client + design tokens
│ └── src/components/ # Shared UI components
├── chrome-extension/ # MV3 browser extension
│ ├── manifest.json # Keyboard shortcut, context menu
│ ├── background.js # Fetch + caching + backoff
│ ├── content.js # Shadow DOM panel
│ └── popup.html/js # Dual-tab UI (Search + Settings)
├── wordpress-plugin/ # WordPress integration skeleton
├── tests/ # pytest suite (641 tests)
├── scripts/ # Batch operations
│ └── batch_assist.py # Batch evidence discovery
├── data/ # Local data (gitignored)
│ ├── raw/ # Downloaded audio
│ ├── transcripts/ # Whisper output
│ ├── datasets/ # Curated CSV datasets
│ ├── cache/ # API response cache (SQLite)
│ ├── exports/ # Generated briefs
│ └── veritas.sqlite # Claim database
└── pyproject.toml
- No external LLM — all extraction and scoring is deterministic
- No paid APIs — runs entirely on local compute + free public APIs
- Privacy first — nothing leaves your machine except structured API queries to public endpoints
- Unknown is honest — Veritas never fabricates confidence. No evidence means UNKNOWN
- Explainability — every claim logs which rules fired; every evidence suggestion logs its scoring breakdown
- AUTO vs HUMAN — exports clearly separate machine suggestions from human verification
- Temporal awareness — claims with dates are scored against time-relevant evidence
- URL normalization — evidence URLs are normalized before storage to prevent duplicates from tracking parameters or host aliases
MIT
Built and maintained by Obelus Labs LLC.
Veritas — Latin for "truth."
If this project was useful to you, consider giving it a star — it helps others discover it.