Skip to content

Obelus-Labs-LLC/Veritas

Repository files navigation

Veritas

A deterministic claim extraction and evidence discovery engine.

Veritas extracts claims from audio, video, text, and PDF sources using NLP — no large language models — then discovers supporting evidence from 29 free structured APIs using rule-based methods. Every claim maps to primary sources with full scoring transparency.

Zero LLM dependency. Zero hallucination risk. Built for podcasters, journalists, and researchers who need to verify what was said.


What It Does

Veritas takes any audio, video, text, or PDF source and runs it through a fully deterministic pipeline:

  1. Ingest — downloads audio via yt-dlp, reads text/PDF files, fetches web articles (trafilatura extraction), or pulls YouTube captions instantly without GPU
  2. Transcribe — GPU-accelerated speech-to-text using faster-whisper (CTranslate2 / CUDA), or direct text-to-segment conversion for document intake. YouTube captions bypass transcription entirely.
  3. Extract Claims — rule-based NLP identifies checkable factual statements from the transcript. No LLM, no prompt engineering — uses sentence boundary detection, named entity recognition, assertion verb patterns, and signal scoring. Optional spaCy enhancement for better sentence splitting, NER, and subject detection
  4. Categorize — context-aware keyword classification across 12 categories routes each claim to the most relevant evidence sources. Source metadata (title, channel) influences categorization so claims inherit context from their source
  5. Verify — smart routing with 13 content-aware signals sends claims to free, structured APIs. BM25Okapi scoring with entity matching, number matching (±5% tolerance), keyphrase alignment, temporal awareness, and evidence type weighting. Query variant fallback retries with alternative queries when primary returns nothing. Strict guardrails prevent false positives
  6. Cluster — a knowledge graph fingerprints claims, groups them by category and numeric/entity content, then clusters related claims across sources using NetworkX Louvain community detection (falls back to Union-Find if NetworkX unavailable). PageRank centrality selects cluster representatives. Cross-source clusters get consensus scoring (strong/moderate/weak/insufficient). D3-compatible JSON export for frontend visualization

The result: a structured database of claims, each linked to candidate evidence with full scoring transparency, plus a knowledge graph showing how claims relate across sources.


Verification Approach

Veritas takes a fundamentally different approach from LLM-based fact-checkers:

  • Extraction is deterministic — the same transcript always produces the same claims. No temperature, no sampling, no prompt sensitivity
  • Verification is rule-based — scoring functions use BM25Okapi textual relevance, entity matching, exact number matching (±5% tolerance), and evidence type classification. No embeddings, no semantic similarity
  • Evidence comes from primary sources — SEC filings, academic papers, government datasets, market data, fact-check organizations. Not web search, not LLM-generated summaries
  • Temporal awareness — claims with dates are matched against time-relevant evidence; stale data is penalized
  • Cross-source consensus — when multiple independent sources agree, confidence increases. Consensus scoring analyzes agreement across all evidence
  • Unknown is the default — if the evidence APIs return nothing relevant, the claim stays UNKNOWN. Veritas never guesses

Auto Status Guardrails

Status Conditions
SUPPORTED Score >= 85 with primary source + BM25/token overlap + keyphrase or exact number match
PARTIAL Score 70-84, or high score missing some signal requirements
UNKNOWN Everything else (the honest default)

CONTRADICTED is never set automatically — too risky for an automated system. Finance claims have additional guardrails requiring specific financial metric matches, not just entity name overlap.


Evidence Sources

29 free APIs. No API keys required (optional keys for higher rate limits on FRED, Google Fact Check, CourtListener, OpenSanctions, and OpenCorporates).

Source Type Best For
SEC EDGAR filing Company financials, earnings, 10-K/10-Q/8-K filings
SEC Gov gov SEC publications, reports, and regulatory documents
yfinance dataset Real-time market data, stock prices, market cap, revenue
FRED dataset Macroeconomic indicators — GDP, CPI, unemployment, federal funds rate
U.S. Treasury gov Federal debt, revenue, spending, fiscal data
BLS gov Labor statistics — employment, wages, CPI, PPI
CBO gov Congressional Budget Office reports and projections
USASpending gov Federal government spending and contract awards
Census gov Population, demographics, housing, income statistics (ACS 2023)
World Bank dataset International development indicators across 200+ countries
OpenFDA gov Drug safety, adverse events, device recalls
PatentsView dataset USPTO patent and invention data
Crossref paper Academic papers across all fields (DOI-linked)
arXiv paper AI/ML, physics, mathematics, computer science preprints
PubMed paper Biomedical and health research (PMID-linked)
Semantic Scholar paper AI-curated academic search across all disciplines
Wikipedia secondary Named entity context, background reference
Wikidata dataset Structured knowledge base — entities, relationships, facts
DuckDuckGo search General web search fallback for uncategorized claims
Google Fact Check factcheck Verified fact-checks from PolitiFact, Snopes, Full Fact, AFP, Reuters, and IFCN-certified publishers
Congress.gov gov Congressional bills, legislation, and legislative activity
GovInfo gov U.S. Government Publishing Office — federal documents
FEC gov Federal Election Commission — campaign finance data
OpenStates gov State legislature bills, votes, and legislator data
WHO GHO dataset World Health Organization Global Health Observatory indicators
CourtListener gov Court opinions and RECAP dockets for legal claims
OpenSanctions dataset Sanctions and politically exposed persons (PEP) entity matching
OpenCorporates dataset Corporate registry data across jurisdictions
Local Datasets dataset Curated CSV datasets (FRED historical, corporate financials) for offline matching

Smart Routing

Smart routing uses 15 content-aware signals to optimize source ordering per claim:

  • Company mentions boost yfinance + SEC EDGAR
  • Academic language boosts arXiv + Crossref + Semantic Scholar
  • Health/clinical terms boost PubMed + OpenFDA + WHO GHO
  • Financial metrics boost yfinance + SEC EDGAR + FRED + Treasury
  • Drug/pharmaceutical terms boost OpenFDA
  • Labor/employment terms boost BLS
  • Budget/spending terms boost CBO + USASpending + Treasury
  • Demographics terms boost Census
  • International indicators boost World Bank
  • Patent/invention terms boost PatentsView
  • Legislative/political terms boost Congress.gov + GovInfo + FEC + OpenStates
  • Legal terms boost Congress.gov + GovInfo + SEC Gov + CourtListener
  • Sanctions/PEP terms boost OpenSanctions
  • Corporate registry terms boost OpenCorporates

Query Variant Fallback

When a primary search query returns no results, Veritas automatically generates alternative queries using three zero-LLM transforms:

  • Entity-focused — extracts proper nouns, numbers, and key terms
  • Synonym swap — replaces verbs with synonyms (e.g., "increased" → "grew")
  • Nominalization — converts verbs to noun forms (e.g., "increased" → "increase in")

Document Ingestion

Veritas ingests content from multiple sources with intelligent routing:

Intake Path Description
YouTube URL Auto-detected → fast caption extraction (no GPU needed)
YouTube audio Falls back to GPU transcription when captions unavailable
Web URL Article text extracted via trafilatura (3-stage cascade) with metadata extraction
PDF URL Auto-detected → downloaded and text extracted
Plain text .txt files — read and segmented automatically
PDF file Requires PyMuPDF or pdfplumber — text extracted and segmented
Raw text Inline text string via CLI or API
RSS/Atom feeds Parses feeds and creates sources from entries
SRT/VTT Subtitle/caption file parsing

URL Ingestion Features

  • URL-type routing — YouTube, PDF, and HTML URLs are auto-detected and routed to the appropriate extractor
  • trafilatura extraction — 3-stage cascade (custom XPath → JusText/Readability → baseline) for high-quality article text
  • HTML metadata extraction — author, published date, site name, and keywords extracted from <meta> tags
  • Image alt-text extractionalt attributes and <figcaption> content appended as supplementary text for claim extraction
  • Content quality gate — paywall messages, 404 pages, and cookie consent text are rejected before processing

Knowledge Graph

Veritas clusters related claims across sources to find consensus and contradiction:

  • Fingerprinting — each claim is tokenized and numeric values extracted for comparison
  • Blocking — claims are grouped by category + shared numbers AND category + shared entities to reduce comparison space
  • Clustering — NetworkX Louvain community detection groups related claims (falls back to Union-Find if NetworkX unavailable)
  • PageRank centrality — selects the most representative claim in each cluster
  • Modularity scoring — measures cluster quality
  • Cross-source only — clusters require claims from different sources (same-source repeats are filtered)
  • Consensus scoring — analyzes agreement across evidence sources using inverse normalized standard deviation. Returns strong/moderate/weak/insufficient status
  • D3 exportgraph_to_json() exports clusters as D3-compatible JSON for frontend visualization

Cross-Source Intelligence

Veritas tracks how claims move across sources:

  • Spread Analysis — identifies the same claim appearing across multiple videos or channels, scored by global content hash and fuzzy matching
  • Timeline Tracking — maps when claims first appear and how they propagate
  • Top Claims Ranking — surfaces the most-repeated claims across your entire corpus, ranked by cross-source frequency
  • Contradiction Detection — planned: will flag cases where sources make conflicting factual assertions about the same topic

Web Frontend & API

FastAPI Service Layer

RESTful API wrapping the full pipeline. Supports SSE for real-time progress updates during long evidence discovery runs.

uvicorn veritas.api:app --host 0.0.0.0 --port 8000

Endpoints: submit (YouTube URL, text, file upload), sources, claims, evidence, clusters, search, stats, Quick Check (single claim), and ClaimReview schema.org export.

React Web Frontend

Dark command-center aesthetic with 3 screens:

  • Home — submit (YouTube URL, text, file upload) + dashboard stats
  • Results — claims, evidence, clusters, and search in a consolidated view
  • Vault — saved and verified claims feed
cd web && npm install && npm run dev

Chrome Extension

MV3 browser extension for inline fact-checking:

  • Right-click context menu + keyboard shortcut (Ctrl+Shift+V)
  • Shadow DOM panel isolation
  • Result caching (chrome.storage.local, 24h TTL)
  • Quick links to Snopes, Wikipedia, Google Scholar
  • ARIA accessibility labels

ClaimReview Schema Export

Generates Schema.org ClaimReview JSON-LD markup that publishers can embed in their pages. When embedded, claims appear in Google search results as fact-checks.


Tech Stack

Component Technology
Language Python 3.12+
Transcription faster-whisper (CTranslate2, CUDA-accelerated)
Audio download yt-dlp
Database SQLite (single-file, zero-config)
API server FastAPI + SSE
Web frontend React + Vite + Tailwind CSS
CLI Click + Rich
NLP Rule-based + optional spaCy (en_core_web_sm) for enhanced sentence splitting, NER, subject detection
Text extraction trafilatura (HTML article extraction)
Scoring BM25Okapi + rule-based (0-100 scale)
HTTP client requests (29 free sources)
Caching SQLite-backed API response cache (per-source TTL)
GPU support NVIDIA CUDA 12 via pip (nvidia-cublas-cu12, nvidia-cudnn-cu12)
Testing pytest (641 tests)

Quick Start

# Clone and set up
git clone https://github.com/Obelus-Labs-LLC/Veritas.git
cd Veritas
python -m venv .venv
.venv\Scripts\activate        # Windows
# source .venv/bin/activate   # Linux/Mac

# Install
pip install -e .

# (Optional) Install CUDA-enabled PyTorch for GPU transcription
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# (Optional) Install spaCy for enhanced NLP (better sentence splitting, NER, subject detection)
pip install "veritas[nlp]"
python -m spacy download en_core_web_sm

# Run the pipeline on a YouTube video (uses fast caption extraction)
python -m veritas ingest "https://www.youtube.com/watch?v=EXAMPLE"
python -m veritas claims <source_id>
python -m veritas assist <source_id>

# Or ingest text/PDF directly (no audio, no transcription)
python -m veritas ingest-text path/to/document.pdf
python -m veritas claims <source_id>
python -m veritas assist <source_id>

# Or ingest a web article (auto-extracts article text + metadata)
python -m veritas ingest-url "https://example.com/article"
python -m veritas claims <source_id>
python -m veritas assist <source_id>

# Build the knowledge graph
python -m veritas build-graph

# Check results
python -m veritas sources
python -m veritas queue
python -m veritas inspect-verified --status supported --verbose
python -m veritas clusters
python -m veritas export <source_id> --format md

# Run the web frontend
uvicorn veritas.api:app --port 8000  # API server
cd web && npm run dev                 # React frontend on :5173

CLI Commands

Command Description
python -m veritas ingest <url> Download audio, save metadata, register source
python -m veritas ingest-text <path> Ingest a text or PDF file directly (no audio)
python -m veritas ingest-url <url> Ingest a web article URL (auto-detects YouTube/PDF/HTML)
python -m veritas transcribe <id> Transcribe audio with faster-whisper (GPU)
python -m veritas claims <id> Extract candidate claims (deterministic, rule-based)
python -m veritas assist <id> Auto-discover evidence from 29 free APIs
python -m veritas build-graph Build the knowledge graph (fingerprint, cluster, consensus)
python -m veritas clusters Show top claim clusters from the knowledge graph
python -m veritas cluster <id> Show detailed view of a single cluster and its members
python -m veritas queue Show claims needing review, sorted by priority
python -m veritas review <id> Interactively review and verify claims
python -m veritas verify <claim_id> Set status and attach evidence for a single claim
python -m veritas inspect-verified Inspect auto-verified claims with evidence and signals
python -m veritas export <id> Generate Markdown or JSON brief with provenance labels
python -m veritas search "<query>" Full-text search across all claims
python -m veritas sources List all ingested sources with verification metrics
python -m veritas spread <claim_hash> Show where a claim appears across sources
python -m veritas timeline <claim_hash> Chronological propagation of a claim
python -m veritas top-claims Most-repeated claims across all sources
python -m veritas doctor Check environment, GPU, and dependency status

How Claim Extraction Works

Veritas uses deterministic rules — no AI, no API calls — to identify checkable statements:

  1. Segment Stitching — Whisper outputs segments with arbitrary boundaries. Veritas merges adjacent segments into windows so complete sentences can be recovered across boundaries.

  2. Sentence Splitting — The stitched window is split at punctuation boundaries. Fragments shorter than 7 words or 40 characters are rejected. Claims are capped at 240 characters.

  3. Candidate Detection — A sentence becomes a candidate claim if it contains a signal (numbers, dates, named entities, or assertion verbs) AND has a subject-like anchor (proper noun, pronoun, or number).

  4. Fragment Filtering — Dangling clauses starting with conjunctions are rejected. YouTube and podcast boilerplate is filtered out (sponsor messages, review requests, self-references).

  5. Classification — Each claim gets confidence language (hedged/definitive/unknown), a category (12 categories), and a pipe-delimited signal log showing exactly which rules fired.

  6. Deduplication — Two-layer dedup: SHA256 hash for exact matches (local and global), plus SequenceMatcher (0.78 threshold) for near-duplicates.


Claim Categories

Category Routes To
finance yfinance, SEC EDGAR, SEC Gov, FRED, Treasury, BLS, CBO, USASpending, OpenCorporates, Google Fact Check, Crossref, Wikipedia
health WHO GHO, PubMed, OpenFDA, Google Fact Check, Crossref, Semantic Scholar, Wikipedia
science arXiv, Semantic Scholar, Crossref, PubMed, World Bank, Wikipedia
tech arXiv, Crossref, PatentsView, OpenCorporates, Google Fact Check, Wikipedia
politics Congress.gov, Google Fact Check, FEC, GovInfo, OpenStates, OpenSanctions, Treasury, SEC Gov, CBO, USASpending, Wikipedia
military Google Fact Check, Congress.gov, USASpending, GovInfo, OpenSanctions, Crossref, Wikipedia
education Census, World Bank, Crossref, Google Fact Check, Semantic Scholar, Wikipedia
energy_climate World Bank, Crossref, arXiv, Google Fact Check, Wikipedia
labor BLS, FRED, Census, Google Fact Check, Crossref, Wikipedia
history_culture Wikipedia, Wikidata, Crossref, Semantic Scholar, DuckDuckGo, Google Fact Check
legal CourtListener, Google Fact Check, Congress.gov, GovInfo, Wikipedia, CBO, Crossref, SEC Gov
general All 29 sources in fixed order

Integration: WeThePeople

Veritas is being developed alongside WeThePeople, a civic transparency platform. They are separate projects today with a planned integration path:

  • WeThePeople collects and organizes public political content — congressional hearings, campaign speeches, policy debates
  • Veritas provides the verification layer — extracting claims and checking them against primary sources
  • The integration path: WeThePeople sends politician hearing clips and speech transcripts to Veritas for automated claim extraction and evidence verification, then surfaces the results to citizens

Together, they form a pipeline from raw political speech to verified, evidence-linked claims — with full transparency at every step.


Running Tests

pip install -e ".[dev]"
python -m pytest tests/ -q --tb=short

641 tests. Tests use fixture transcripts and mocked APIs — no network calls or GPU required.


Architecture

veritas-app/
├── src/veritas/
│   ├── cli.py              # Click CLI (20 commands)
│   ├── api.py              # FastAPI service layer (REST + SSE)
│   ├── ingest.py           # Audio download (yt-dlp)
│   ├── ingest_text.py      # Text/PDF/URL ingestion (trafilatura, URL-type routing)
│   ├── ingest_captions.py  # Fast YouTube caption extraction (no GPU)
│   ├── transcript_parser.py # SRT/VTT subtitle parsing
│   ├── rss_ingest.py       # RSS/Atom feed ingestion
│   ├── transcribe.py       # Speech-to-text (faster-whisper)
│   ├── claim_extract.py    # Deterministic claim extraction (12 categories)
│   ├── assist.py           # Smart routing + evidence orchestration (15 signals)
│   ├── scoring.py          # BM25Okapi scoring + consensus (0-100)
│   ├── query_variants.py   # Zero-LLM query variant generation
│   ├── url_normalize.py    # URL normalization + evidence deduplication
│   ├── knowledge_graph.py  # NetworkX Louvain clustering, PageRank, D3 export
│   ├── verdict.py          # Template-based verdict summaries for journalists
│   ├── claimreview.py      # Schema.org ClaimReview JSON-LD export
│   ├── job_queue.py        # SQLite job queue (pool routing, graceful shutdown)
│   ├── evidence_sources/   # 29 free API integrations
│   │   ├── base.py         # Shared HTTP: rate limiting, caching, backoff
│   │   ├── crossref.py     ├── arxiv.py
│   │   ├── pubmed.py       ├── semantic_scholar.py
│   │   ├── sec_edgar.py    ├── sec_gov.py
│   │   ├── yfinance_source.py ├── fred_source.py
│   │   ├── treasury.py     ├── wikipedia_source.py
│   │   ├── wikidata.py     ├── duckduckgo.py
│   │   ├── google_factcheck.py ├── openfda.py
│   │   ├── bls.py          ├── cbo.py
│   │   ├── usaspending.py  ├── census.py
│   │   ├── worldbank.py    ├── patentsview.py
│   │   ├── congress.py     ├── govinfo.py
│   │   ├── fec.py          ├── openstates.py
│   │   ├── who_gho.py      ├── courtlistener.py
│   │   ├── opensanctions.py ├── opencorporates.py
│   │   └── local_datasets.py
│   ├── verify.py           # Human claim verification
│   ├── db.py               # SQLite schema + migrations
│   ├── models.py           # Data models
│   ├── config.py           # Constants and paths
│   ├── paths.py            # Directory path helpers
│   ├── export.py           # Markdown/JSON brief generation
│   ├── search.py           # Full-text claim search
│   └── doctor.py           # Environment health checks
├── web/                    # React + Vite + Tailwind frontend
│   ├── src/pages/          # 3 screens (Home, Results, Vault)
│   ├── src/lib/            # API client + design tokens
│   └── src/components/     # Shared UI components
├── chrome-extension/       # MV3 browser extension
│   ├── manifest.json       # Keyboard shortcut, context menu
│   ├── background.js       # Fetch + caching + backoff
│   ├── content.js          # Shadow DOM panel
│   └── popup.html/js       # Dual-tab UI (Search + Settings)
├── wordpress-plugin/       # WordPress integration skeleton
├── tests/                  # pytest suite (641 tests)
├── scripts/                # Batch operations
│   └── batch_assist.py     # Batch evidence discovery
├── data/                   # Local data (gitignored)
│   ├── raw/                # Downloaded audio
│   ├── transcripts/        # Whisper output
│   ├── datasets/           # Curated CSV datasets
│   ├── cache/              # API response cache (SQLite)
│   ├── exports/            # Generated briefs
│   └── veritas.sqlite      # Claim database
└── pyproject.toml

Design Principles

  • No external LLM — all extraction and scoring is deterministic
  • No paid APIs — runs entirely on local compute + free public APIs
  • Privacy first — nothing leaves your machine except structured API queries to public endpoints
  • Unknown is honest — Veritas never fabricates confidence. No evidence means UNKNOWN
  • Explainability — every claim logs which rules fired; every evidence suggestion logs its scoring breakdown
  • AUTO vs HUMAN — exports clearly separate machine suggestions from human verification
  • Temporal awareness — claims with dates are scored against time-relevant evidence
  • URL normalization — evidence URLs are normalized before storage to prevent duplicates from tracking parameters or host aliases

License

MIT


Built and maintained by Obelus Labs LLC.

Veritas — Latin for "truth."


If this project was useful to you, consider giving it a star — it helps others discover it.

About

Deterministic claim extraction and fact-verification engine. Zero LLM. Extracts claims from audio/video/text, verifies against 20 free APIs, clusters cross-source consensus via knowledge graph.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors