Veritas

A deterministic claim extraction and evidence discovery engine.

Veritas extracts claims from audio, video, text, and PDF sources using NLP — no large language models — then discovers supporting evidence from 29 free structured APIs using rule-based methods. Every claim maps to primary sources with full scoring transparency.

Zero LLM dependency. Zero hallucination risk. Built for podcasters, journalists, and researchers who need to verify what was said.

What It Does

Veritas takes any audio, video, text, or PDF source and runs it through a fully deterministic pipeline:

Ingest — downloads audio via yt-dlp, reads text/PDF files, fetches web articles (trafilatura extraction), or pulls YouTube captions instantly without GPU
Transcribe — GPU-accelerated speech-to-text using faster-whisper (CTranslate2 / CUDA), or direct text-to-segment conversion for document intake. YouTube captions bypass transcription entirely.
Extract Claims — rule-based NLP identifies checkable factual statements from the transcript. No LLM, no prompt engineering — uses sentence boundary detection, named entity recognition, assertion verb patterns, and signal scoring. Optional spaCy enhancement for better sentence splitting, NER, and subject detection
Categorize — context-aware keyword classification across 12 categories routes each claim to the most relevant evidence sources. Source metadata (title, channel) influences categorization so claims inherit context from their source
Verify — smart routing with 13 content-aware signals sends claims to free, structured APIs. BM25Okapi scoring with entity matching, number matching (±5% tolerance), keyphrase alignment, temporal awareness, and evidence type weighting. Query variant fallback retries with alternative queries when primary returns nothing. Strict guardrails prevent false positives
Cluster — a knowledge graph fingerprints claims, groups them by category and numeric/entity content, then clusters related claims across sources using NetworkX Louvain community detection (falls back to Union-Find if NetworkX unavailable). PageRank centrality selects cluster representatives. Cross-source clusters get consensus scoring (strong/moderate/weak/insufficient). D3-compatible JSON export for frontend visualization

The result: a structured database of claims, each linked to candidate evidence with full scoring transparency, plus a knowledge graph showing how claims relate across sources.

Verification Approach

Veritas takes a fundamentally different approach from LLM-based fact-checkers:

Extraction is deterministic — the same transcript always produces the same claims. No temperature, no sampling, no prompt sensitivity
Verification is rule-based — scoring functions use BM25Okapi textual relevance, entity matching, exact number matching (±5% tolerance), and evidence type classification. No embeddings, no semantic similarity
Evidence comes from primary sources — SEC filings, academic papers, government datasets, market data, fact-check organizations. Not web search, not LLM-generated summaries
Temporal awareness — claims with dates are matched against time-relevant evidence; stale data is penalized
Cross-source consensus — when multiple independent sources agree, confidence increases. Consensus scoring analyzes agreement across all evidence
Unknown is the default — if the evidence APIs return nothing relevant, the claim stays UNKNOWN. Veritas never guesses

Auto Status Guardrails

Status	Conditions
SUPPORTED	Score >= 85 with primary source + BM25/token overlap + keyphrase or exact number match
PARTIAL	Score 70-84, or high score missing some signal requirements
UNKNOWN	Everything else (the honest default)

CONTRADICTED is never set automatically — too risky for an automated system. Finance claims have additional guardrails requiring specific financial metric matches, not just entity name overlap.

Evidence Sources

29 free APIs. No API keys required (optional keys for higher rate limits on FRED, Google Fact Check, CourtListener, OpenSanctions, and OpenCorporates).

Source	Type	Best For
SEC EDGAR	`filing`	Company financials, earnings, 10-K/10-Q/8-K filings
SEC Gov	`gov`	SEC publications, reports, and regulatory documents
yfinance	`dataset`	Real-time market data, stock prices, market cap, revenue
FRED	`dataset`	Macroeconomic indicators — GDP, CPI, unemployment, federal funds rate
U.S. Treasury	`gov`	Federal debt, revenue, spending, fiscal data
BLS	`gov`	Labor statistics — employment, wages, CPI, PPI
CBO	`gov`	Congressional Budget Office reports and projections
USASpending	`gov`	Federal government spending and contract awards
Census	`gov`	Population, demographics, housing, income statistics (ACS 2023)
World Bank	`dataset`	International development indicators across 200+ countries
OpenFDA	`gov`	Drug safety, adverse events, device recalls
PatentsView	`dataset`	USPTO patent and invention data
Crossref	`paper`	Academic papers across all fields (DOI-linked)
arXiv	`paper`	AI/ML, physics, mathematics, computer science preprints
PubMed	`paper`	Biomedical and health research (PMID-linked)
Semantic Scholar	`paper`	AI-curated academic search across all disciplines
Wikipedia	`secondary`	Named entity context, background reference
Wikidata	`dataset`	Structured knowledge base — entities, relationships, facts
DuckDuckGo	`search`	General web search fallback for uncategorized claims
Google Fact Check	`factcheck`	Verified fact-checks from PolitiFact, Snopes, Full Fact, AFP, Reuters, and IFCN-certified publishers
Congress.gov	`gov`	Congressional bills, legislation, and legislative activity
GovInfo	`gov`	U.S. Government Publishing Office — federal documents
FEC	`gov`	Federal Election Commission — campaign finance data
OpenStates	`gov`	State legislature bills, votes, and legislator data
WHO GHO	`dataset`	World Health Organization Global Health Observatory indicators
CourtListener	`gov`	Court opinions and RECAP dockets for legal claims
OpenSanctions	`dataset`	Sanctions and politically exposed persons (PEP) entity matching
OpenCorporates	`dataset`	Corporate registry data across jurisdictions
Local Datasets	`dataset`	Curated CSV datasets (FRED historical, corporate financials) for offline matching

Smart Routing

Smart routing uses 15 content-aware signals to optimize source ordering per claim:

Company mentions boost yfinance + SEC EDGAR
Academic language boosts arXiv + Crossref + Semantic Scholar
Health/clinical terms boost PubMed + OpenFDA + WHO GHO
Financial metrics boost yfinance + SEC EDGAR + FRED + Treasury
Drug/pharmaceutical terms boost OpenFDA
Labor/employment terms boost BLS
Budget/spending terms boost CBO + USASpending + Treasury
Demographics terms boost Census
International indicators boost World Bank
Patent/invention terms boost PatentsView
Legislative/political terms boost Congress.gov + GovInfo + FEC + OpenStates
Legal terms boost Congress.gov + GovInfo + SEC Gov + CourtListener
Sanctions/PEP terms boost OpenSanctions
Corporate registry terms boost OpenCorporates

Query Variant Fallback

When a primary search query returns no results, Veritas automatically generates alternative queries using three zero-LLM transforms:

Entity-focused — extracts proper nouns, numbers, and key terms
Synonym swap — replaces verbs with synonyms (e.g., "increased" → "grew")
Nominalization — converts verbs to noun forms (e.g., "increased" → "increase in")

Document Ingestion

Veritas ingests content from multiple sources with intelligent routing:

Intake Path	Description
YouTube URL	Auto-detected → fast caption extraction (no GPU needed)
YouTube audio	Falls back to GPU transcription when captions unavailable
Web URL	Article text extracted via trafilatura (3-stage cascade) with metadata extraction
PDF URL	Auto-detected → downloaded and text extracted
Plain text	`.txt` files — read and segmented automatically
PDF file	Requires PyMuPDF or pdfplumber — text extracted and segmented
Raw text	Inline text string via CLI or API
RSS/Atom feeds	Parses feeds and creates sources from entries
SRT/VTT	Subtitle/caption file parsing

URL Ingestion Features

URL-type routing — YouTube, PDF, and HTML URLs are auto-detected and routed to the appropriate extractor
trafilatura extraction — 3-stage cascade (custom XPath → JusText/Readability → baseline) for high-quality article text
HTML metadata extraction — author, published date, site name, and keywords extracted from <meta> tags
Image alt-text extraction — alt attributes and <figcaption> content appended as supplementary text for claim extraction
Content quality gate — paywall messages, 404 pages, and cookie consent text are rejected before processing

Knowledge Graph

Veritas clusters related claims across sources to find consensus and contradiction:

Fingerprinting — each claim is tokenized and numeric values extracted for comparison
Blocking — claims are grouped by category + shared numbers AND category + shared entities to reduce comparison space
Clustering — NetworkX Louvain community detection groups related claims (falls back to Union-Find if NetworkX unavailable)
PageRank centrality — selects the most representative claim in each cluster
Modularity scoring — measures cluster quality
Cross-source only — clusters require claims from different sources (same-source repeats are filtered)
Consensus scoring — analyzes agreement across evidence sources using inverse normalized standard deviation. Returns strong/moderate/weak/insufficient status
D3 export — graph_to_json() exports clusters as D3-compatible JSON for frontend visualization

Cross-Source Intelligence

Veritas tracks how claims move across sources:

Spread Analysis — identifies the same claim appearing across multiple videos or channels, scored by global content hash and fuzzy matching
Timeline Tracking — maps when claims first appear and how they propagate
Top Claims Ranking — surfaces the most-repeated claims across your entire corpus, ranked by cross-source frequency
Contradiction Detection — planned: will flag cases where sources make conflicting factual assertions about the same topic

Web Frontend & API

FastAPI Service Layer

RESTful API wrapping the full pipeline. Supports SSE for real-time progress updates during long evidence discovery runs.

uvicorn veritas.api:app --host 0.0.0.0 --port 8000

Endpoints: submit (YouTube URL, text, file upload), sources, claims, evidence, clusters, search, stats, Quick Check (single claim), and ClaimReview schema.org export.

React Web Frontend

Dark command-center aesthetic with 3 screens:

Home — submit (YouTube URL, text, file upload) + dashboard stats
Results — claims, evidence, clusters, and search in a consolidated view
Vault — saved and verified claims feed

cd web && npm install && npm run dev

Chrome Extension

MV3 browser extension for inline fact-checking:

Right-click context menu + keyboard shortcut (Ctrl+Shift+V)
Shadow DOM panel isolation
Result caching (chrome.storage.local, 24h TTL)
Quick links to Snopes, Wikipedia, Google Scholar
ARIA accessibility labels

ClaimReview Schema Export

Generates Schema.org ClaimReview JSON-LD markup that publishers can embed in their pages. When embedded, claims appear in Google search results as fact-checks.

Tech Stack

Component	Technology
Language	Python 3.12+
Transcription	faster-whisper (CTranslate2, CUDA-accelerated)
Audio download	yt-dlp
Database	SQLite (single-file, zero-config)
API server	FastAPI + SSE
Web frontend	React + Vite + Tailwind CSS
CLI	Click + Rich
NLP	Rule-based + optional spaCy (`en_core_web_sm`) for enhanced sentence splitting, NER, subject detection
Text extraction	trafilatura (HTML article extraction)
Scoring	BM25Okapi + rule-based (0-100 scale)
HTTP client	requests (29 free sources)
Caching	SQLite-backed API response cache (per-source TTL)
GPU support	NVIDIA CUDA 12 via pip (nvidia-cublas-cu12, nvidia-cudnn-cu12)
Testing	pytest (641 tests)

Quick Start

# Clone and set up
git clone https://github.com/Obelus-Labs-LLC/Veritas.git
cd Veritas
python -m venv .venv
.venv\Scripts\activate        # Windows
# source .venv/bin/activate   # Linux/Mac

# Install
pip install -e .

# (Optional) Install CUDA-enabled PyTorch for GPU transcription
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# (Optional) Install spaCy for enhanced NLP (better sentence splitting, NER, subject detection)
pip install "veritas[nlp]"
python -m spacy download en_core_web_sm

# Run the pipeline on a YouTube video (uses fast caption extraction)
python -m veritas ingest "https://www.youtube.com/watch?v=EXAMPLE"
python -m veritas claims <source_id>
python -m veritas assist <source_id>

# Or ingest text/PDF directly (no audio, no transcription)
python -m veritas ingest-text path/to/document.pdf
python -m veritas claims <source_id>
python -m veritas assist <source_id>

# Or ingest a web article (auto-extracts article text + metadata)
python -m veritas ingest-url "https://example.com/article"
python -m veritas claims <source_id>
python -m veritas assist <source_id>

# Build the knowledge graph
python -m veritas build-graph

# Check results
python -m veritas sources
python -m veritas queue
python -m veritas inspect-verified --status supported --verbose
python -m veritas clusters
python -m veritas export <source_id> --format md

# Run the web frontend
uvicorn veritas.api:app --port 8000  # API server
cd web && npm run dev                 # React frontend on :5173

CLI Commands

Command	Description
`python -m veritas ingest <url>`	Download audio, save metadata, register source
`python -m veritas ingest-text <path>`	Ingest a text or PDF file directly (no audio)
`python -m veritas ingest-url <url>`	Ingest a web article URL (auto-detects YouTube/PDF/HTML)
`python -m veritas transcribe <id>`	Transcribe audio with faster-whisper (GPU)
`python -m veritas claims <id>`	Extract candidate claims (deterministic, rule-based)
`python -m veritas assist <id>`	Auto-discover evidence from 29 free APIs
`python -m veritas build-graph`	Build the knowledge graph (fingerprint, cluster, consensus)
`python -m veritas clusters`	Show top claim clusters from the knowledge graph
`python -m veritas cluster <id>`	Show detailed view of a single cluster and its members
`python -m veritas queue`	Show claims needing review, sorted by priority
`python -m veritas review <id>`	Interactively review and verify claims
`python -m veritas verify <claim_id>`	Set status and attach evidence for a single claim
`python -m veritas inspect-verified`	Inspect auto-verified claims with evidence and signals
`python -m veritas export <id>`	Generate Markdown or JSON brief with provenance labels
`python -m veritas search "<query>"`	Full-text search across all claims
`python -m veritas sources`	List all ingested sources with verification metrics
`python -m veritas spread <claim_hash>`	Show where a claim appears across sources
`python -m veritas timeline <claim_hash>`	Chronological propagation of a claim
`python -m veritas top-claims`	Most-repeated claims across all sources
`python -m veritas doctor`	Check environment, GPU, and dependency status

How Claim Extraction Works

Veritas uses deterministic rules — no AI, no API calls — to identify checkable statements:

Segment Stitching — Whisper outputs segments with arbitrary boundaries. Veritas merges adjacent segments into windows so complete sentences can be recovered across boundaries.
Sentence Splitting — The stitched window is split at punctuation boundaries. Fragments shorter than 7 words or 40 characters are rejected. Claims are capped at 240 characters.
Candidate Detection — A sentence becomes a candidate claim if it contains a signal (numbers, dates, named entities, or assertion verbs) AND has a subject-like anchor (proper noun, pronoun, or number).
Fragment Filtering — Dangling clauses starting with conjunctions are rejected. YouTube and podcast boilerplate is filtered out (sponsor messages, review requests, self-references).
Classification — Each claim gets confidence language (hedged/definitive/unknown), a category (12 categories), and a pipe-delimited signal log showing exactly which rules fired.
Deduplication — Two-layer dedup: SHA256 hash for exact matches (local and global), plus SequenceMatcher (0.78 threshold) for near-duplicates.

Claim Categories

Category	Routes To
finance	yfinance, SEC EDGAR, SEC Gov, FRED, Treasury, BLS, CBO, USASpending, OpenCorporates, Google Fact Check, Crossref, Wikipedia
health	WHO GHO, PubMed, OpenFDA, Google Fact Check, Crossref, Semantic Scholar, Wikipedia
science	arXiv, Semantic Scholar, Crossref, PubMed, World Bank, Wikipedia
tech	arXiv, Crossref, PatentsView, OpenCorporates, Google Fact Check, Wikipedia
politics	Congress.gov, Google Fact Check, FEC, GovInfo, OpenStates, OpenSanctions, Treasury, SEC Gov, CBO, USASpending, Wikipedia
military	Google Fact Check, Congress.gov, USASpending, GovInfo, OpenSanctions, Crossref, Wikipedia
education	Census, World Bank, Crossref, Google Fact Check, Semantic Scholar, Wikipedia
energy_climate	World Bank, Crossref, arXiv, Google Fact Check, Wikipedia
labor	BLS, FRED, Census, Google Fact Check, Crossref, Wikipedia
history_culture	Wikipedia, Wikidata, Crossref, Semantic Scholar, DuckDuckGo, Google Fact Check
legal	CourtListener, Google Fact Check, Congress.gov, GovInfo, Wikipedia, CBO, Crossref, SEC Gov
general	All 29 sources in fixed order

Integration: WeThePeople

Veritas is being developed alongside WeThePeople, a civic transparency platform. They are separate projects today with a planned integration path:

WeThePeople collects and organizes public political content — congressional hearings, campaign speeches, policy debates
Veritas provides the verification layer — extracting claims and checking them against primary sources
The integration path: WeThePeople sends politician hearing clips and speech transcripts to Veritas for automated claim extraction and evidence verification, then surfaces the results to citizens

Together, they form a pipeline from raw political speech to verified, evidence-linked claims — with full transparency at every step.

Running Tests

pip install -e ".[dev]"
python -m pytest tests/ -q --tb=short

641 tests. Tests use fixture transcripts and mocked APIs — no network calls or GPU required.

Architecture

veritas-app/
├── src/veritas/
│   ├── cli.py              # Click CLI (20 commands)
│   ├── api.py              # FastAPI service layer (REST + SSE)
│   ├── ingest.py           # Audio download (yt-dlp)
│   ├── ingest_text.py      # Text/PDF/URL ingestion (trafilatura, URL-type routing)
│   ├── ingest_captions.py  # Fast YouTube caption extraction (no GPU)
│   ├── transcript_parser.py # SRT/VTT subtitle parsing
│   ├── rss_ingest.py       # RSS/Atom feed ingestion
│   ├── transcribe.py       # Speech-to-text (faster-whisper)
│   ├── claim_extract.py    # Deterministic claim extraction (12 categories)
│   ├── assist.py           # Smart routing + evidence orchestration (15 signals)
│   ├── scoring.py          # BM25Okapi scoring + consensus (0-100)
│   ├── query_variants.py   # Zero-LLM query variant generation
│   ├── url_normalize.py    # URL normalization + evidence deduplication
│   ├── knowledge_graph.py  # NetworkX Louvain clustering, PageRank, D3 export
│   ├── verdict.py          # Template-based verdict summaries for journalists
│   ├── claimreview.py      # Schema.org ClaimReview JSON-LD export
│   ├── job_queue.py        # SQLite job queue (pool routing, graceful shutdown)
│   ├── evidence_sources/   # 29 free API integrations
│   │   ├── base.py         # Shared HTTP: rate limiting, caching, backoff
│   │   ├── crossref.py     ├── arxiv.py
│   │   ├── pubmed.py       ├── semantic_scholar.py
│   │   ├── sec_edgar.py    ├── sec_gov.py
│   │   ├── yfinance_source.py ├── fred_source.py
│   │   ├── treasury.py     ├── wikipedia_source.py
│   │   ├── wikidata.py     ├── duckduckgo.py
│   │   ├── google_factcheck.py ├── openfda.py
│   │   ├── bls.py          ├── cbo.py
│   │   ├── usaspending.py  ├── census.py
│   │   ├── worldbank.py    ├── patentsview.py
│   │   ├── congress.py     ├── govinfo.py
│   │   ├── fec.py          ├── openstates.py
│   │   ├── who_gho.py      ├── courtlistener.py
│   │   ├── opensanctions.py ├── opencorporates.py
│   │   └── local_datasets.py
│   ├── verify.py           # Human claim verification
│   ├── db.py               # SQLite schema + migrations
│   ├── models.py           # Data models
│   ├── config.py           # Constants and paths
│   ├── paths.py            # Directory path helpers
│   ├── export.py           # Markdown/JSON brief generation
│   ├── search.py           # Full-text claim search
│   └── doctor.py           # Environment health checks
├── web/                    # React + Vite + Tailwind frontend
│   ├── src/pages/          # 3 screens (Home, Results, Vault)
│   ├── src/lib/            # API client + design tokens
│   └── src/components/     # Shared UI components
├── chrome-extension/       # MV3 browser extension
│   ├── manifest.json       # Keyboard shortcut, context menu
│   ├── background.js       # Fetch + caching + backoff
│   ├── content.js          # Shadow DOM panel
│   └── popup.html/js       # Dual-tab UI (Search + Settings)
├── wordpress-plugin/       # WordPress integration skeleton
├── tests/                  # pytest suite (641 tests)
├── scripts/                # Batch operations
│   └── batch_assist.py     # Batch evidence discovery
├── data/                   # Local data (gitignored)
│   ├── raw/                # Downloaded audio
│   ├── transcripts/        # Whisper output
│   ├── datasets/           # Curated CSV datasets
│   ├── cache/              # API response cache (SQLite)
│   ├── exports/            # Generated briefs
│   └── veritas.sqlite      # Claim database
└── pyproject.toml

Design Principles

No external LLM — all extraction and scoring is deterministic
No paid APIs — runs entirely on local compute + free public APIs
Privacy first — nothing leaves your machine except structured API queries to public endpoints
Unknown is honest — Veritas never fabricates confidence. No evidence means UNKNOWN
Explainability — every claim logs which rules fired; every evidence suggestion logs its scoring breakdown
AUTO vs HUMAN — exports clearly separate machine suggestions from human verification
Temporal awareness — claims with dates are scored against time-relevant evidence
URL normalization — evidence URLs are normalized before storage to prevent duplicates from tracking parameters or host aliases

License

MIT

Built and maintained by Obelus Labs LLC.

Veritas — Latin for "truth."

If this project was useful to you, consider giving it a star — it helps others discover it.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github		.github
chrome-extension		chrome-extension
scripts		scripts
src/veritas		src/veritas
tests		tests
web		web
wordpress-plugin		wordpress-plugin
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
audit_chrome_extension.md		audit_chrome_extension.md
fly.toml		fly.toml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_veritas.py		run_veritas.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Veritas

What It Does

Verification Approach

Auto Status Guardrails

Evidence Sources

Smart Routing

Query Variant Fallback

Document Ingestion

URL Ingestion Features

Knowledge Graph

Cross-Source Intelligence

Web Frontend & API

FastAPI Service Layer

React Web Frontend

Chrome Extension

ClaimReview Schema Export

Tech Stack

Quick Start

CLI Commands

How Claim Extraction Works

Claim Categories

Integration: WeThePeople

Running Tests

Architecture

Design Principles

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Veritas

What It Does

Verification Approach

Auto Status Guardrails

Evidence Sources

Smart Routing

Query Variant Fallback

Document Ingestion

URL Ingestion Features

Knowledge Graph

Cross-Source Intelligence

Web Frontend & API

FastAPI Service Layer

React Web Frontend

Chrome Extension

ClaimReview Schema Export

Tech Stack

Quick Start

CLI Commands

How Claim Extraction Works

Claim Categories

Integration: WeThePeople

Running Tests

Architecture

Design Principles

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages