New Canonical Data Scheme #12

rfdpro · 2026-02-03T18:24:51Z

rfdpro
Feb 3, 2026

The data ingestion pipeline should serve as the primary interface for raw documents. Converting arbitrary file, format, and schema types into canonical, albeit extensible, structured data. As a result, this pipeline's derivative schema should universally attest to the source content and ideally take a form useful for any other component module.

Current model *key fields

• arkham_frame.documents
	○ Filename
	○ File size
	○ Page count
	○ Metadata (JSONB)
• arkham_frame.pages
	○ Text
	○ Image path
	○ Page metadata (JSONB)
	○ FK(document)
• arkham_frame.chunks
	○ Text
	○ FK(document)
	○ FK(vector)
• arkham_frame.entities
	○ Text
	○ Type
	○ Source key
	○ FK(canonical entity)
• arkham_frame.canonical_entities
	○ Name
	○ Type
	○ Aliases
• Arkham_entity_mentions
	○ Text
	○ FK(entities)
	○ FK(document)
• Arkham_claims
	○ Text
	○ Claim type (factual, opinion, prediction, quantitative, attribution)
	○ Verification status
	○ FK(document)
• Arkham_claim_evidence
	○ FK(claim)
	○ Text
	○ Type (document, entity, external, claim)
	○ Notes

Commentary

The current model inefficiently captures analytically relevant content through functionally duplicate fields and weak relationship mapping.

Proposal (Artifact Model)

High Level

Artifacts:
1. Chunk - Semantically portioned, structured, raw components of sources. Used for lexical search and textual reference (concordance).
2. Entities - Subject, actor, object, or location of claims. Including references to sources.
3. Claims - Assertions (predicate) of qualities associated to entities or claims (subject/object) according to a chunk (therefore, source). Optionally, by an entity (actor).

Support:
1. Documents - Functionally, metadata; Generated or processed information adjacent to sources. Used for filtering and monitoring.
2. Source - Stored as object, used as reference and for visual grounding and review.

Model *key fields

(Documents: Custody)
	- Filename
	- Title
	- Object (pointer to original object)
	- Ingest source
	- Acquisition method
	- Custody chain
	- Ingestion version
	- Version

(Chunk: Evidence)
	- Content (verbatim text)
	- Annotations (Tags, structure, text features, form elements, ext.)
	- Context 
	- Summary (non-authoritative chunk description) 
	- Extraction Confidence (accuracy of content representation)
	- Format
	- Extraction pipeline version
	- Derivation (pointer to original object)
	- Index
	- Parent (pointer to parent chunk)
	- Children (pointer to children chunks)

(Entity: Identity)
	- Type (Document, Video, Person, Organization, Location, Account, Software, ext.)
	- Stable identifier
	- Label
	- Aliases
	- Disambiguation
	- Count (across corpus)

(Claim: Meaning)
	- Subject (Entity or Claim)
	- Predicate
	- Object (Entity or Claim)
	- Value
	- Value type
	- Occurrence time
	- Occurrence time precision
	- Duration
	- Duration precision
	- Claim time
	- Claim time precision
	- Location (Entity)
	- Actor (Entity)
	- Polarity (affirmed / denied)
	- Sources (Chunk)
	- Attribution indexes (references to location in chunks where field value is found)
	- Confidence (likelihood of truthiness)

Notes

Chunks never assert truth.
Claims never attest content.
Entities never imply facts.

Generally, we never update in place and instead append new artifacts. The except may be for disambiguation.

Conceptually, documents table will be used for raw object tracking and reference but little more. The chunks serve the purpose of "mentions" but do a bit more as well. Entire sources will be deconstructed into small-to-big chunks so we aren't storing raw text redundantly. Full document text is easy enough to reconstruct with this model as well as semantically relevant portions. With that said, Claims, Entities, and Chunks are all first-class-citizens in this model and will all, later, have embedded equivalents for semantic RAG in addition to the lexical search possible given this table field model.

Overview

This knowledge acquisition system robustly compiles and relates knowledge held within source documents in a structured and scalable format. Capable of:

• Handling ambiguity and hostile data
• Survives reprocessing
• Supporting adversarial review
• Tight inter- and intra- concept association
• Minimizing data expansion (~5x -> ~2.5x (50% reduction))
• Improved information density (difficult to calculate; back of the napkin ~3x)
• Aligning well with analysis
• Providing a clear interface for database optimizations
• Supporting audits

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Canonical Data Scheme #12

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

New Canonical Data Scheme #12

Uh oh!

rfdpro Feb 3, 2026

Current model *key fields

Commentary

Proposal (Artifact Model)

High Level

Model *key fields

Notes

Overview

Replies: 0 comments

rfdpro
Feb 3, 2026