You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The data ingestion pipeline should serve as the primary interface for raw documents. Converting arbitrary file, format, and schema types into canonical, albeit extensible, structured data. As a result, this pipeline's derivative schema should universally attest to the source content and ideally take a form useful for any other component module.
Current model *key fields
• arkham_frame.documents
○ Filename
○ File size
○ Page count
○ Metadata (JSONB)
• arkham_frame.pages
○ Text
○ Image path
○ Page metadata (JSONB)
○ FK(document)
• arkham_frame.chunks
○ Text
○ FK(document)
○ FK(vector)
• arkham_frame.entities
○ Text
○ Type
○ Source key
○ FK(canonical entity)
• arkham_frame.canonical_entities
○ Name
○ Type
○ Aliases
• Arkham_entity_mentions
○ Text
○ FK(entities)
○ FK(document)
• Arkham_claims
○ Text
○ Claim type (factual, opinion, prediction, quantitative, attribution)
○ Verification status
○ FK(document)
• Arkham_claim_evidence
○ FK(claim)
○ Text
○ Type (document, entity, external, claim)
○ Notes
Commentary
The current model inefficiently captures analytically relevant content through functionally duplicate fields and weak relationship mapping.
Proposal (Artifact Model)
High Level
Artifacts:
1. Chunk - Semantically portioned, structured, raw components of sources. Used for lexical search and textual reference (concordance).
2. Entities - Subject, actor, object, or location of claims. Including references to sources.
3. Claims - Assertions (predicate) of qualities associated to entities or claims (subject/object) according to a chunk (therefore, source). Optionally, by an entity (actor).
Support:
1. Documents - Functionally, metadata; Generated or processed information adjacent to sources. Used for filtering and monitoring.
2. Source - Stored as object, used as reference and for visual grounding and review.
Model *key fields
(Documents: Custody)
- Filename
- Title
- Object (pointer to original object)
- Ingest source
- Acquisition method
- Custody chain
- Ingestion version
- Version
(Chunk: Evidence)
- Content (verbatim text)
- Annotations (Tags, structure, text features, form elements, ext.)
- Context
- Summary (non-authoritative chunk description)
- Extraction Confidence (accuracy of content representation)
- Format
- Extraction pipeline version
- Derivation (pointer to original object)
- Index
- Parent (pointer to parent chunk)
- Children (pointer to children chunks)
(Entity: Identity)
- Type (Document, Video, Person, Organization, Location, Account, Software, ext.)
- Stable identifier
- Label
- Aliases
- Disambiguation
- Count (across corpus)
(Claim: Meaning)
- Subject (Entity or Claim)
- Predicate
- Object (Entity or Claim)
- Value
- Value type
- Occurrence time
- Occurrence time precision
- Duration
- Duration precision
- Claim time
- Claim time precision
- Location (Entity)
- Actor (Entity)
- Polarity (affirmed / denied)
- Sources (Chunk)
- Attribution indexes (references to location in chunks where field value is found)
- Confidence (likelihood of truthiness)
Notes
Chunks never assert truth.
Claims never attest content.
Entities never imply facts.
Generally, we never update in place and instead append new artifacts. The except may be for disambiguation.
Conceptually, documents table will be used for raw object tracking and reference but little more. The chunks serve the purpose of "mentions" but do a bit more as well. Entire sources will be deconstructed into small-to-big chunks so we aren't storing raw text redundantly. Full document text is easy enough to reconstruct with this model as well as semantically relevant portions. With that said, Claims, Entities, and Chunks are all first-class-citizens in this model and will all, later, have embedded equivalents for semantic RAG in addition to the lexical search possible given this table field model.
Overview
This knowledge acquisition system robustly compiles and relates knowledge held within source documents in a structured and scalable format. Capable of:
• Handling ambiguity and hostile data
• Survives reprocessing
• Supporting adversarial review
• Tight inter- and intra- concept association
• Minimizing data expansion (~5x -> ~2.5x (50% reduction))
• Improved information density (difficult to calculate; back of the napkin ~3x)
• Aligning well with analysis
• Providing a clear interface for database optimizations
• Supporting audits
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
The data ingestion pipeline should serve as the primary interface for raw documents. Converting arbitrary file, format, and schema types into canonical, albeit extensible, structured data. As a result, this pipeline's derivative schema should universally attest to the source content and ideally take a form useful for any other component module.
Current model *key fields
Commentary
The current model inefficiently captures analytically relevant content through functionally duplicate fields and weak relationship mapping.
Proposal (Artifact Model)
High Level
Artifacts:
1. Chunk - Semantically portioned, structured, raw components of sources. Used for lexical search and textual reference (concordance).
2. Entities - Subject, actor, object, or location of claims. Including references to sources.
3. Claims - Assertions (predicate) of qualities associated to entities or claims (subject/object) according to a chunk (therefore, source). Optionally, by an entity (actor).
Support:
1. Documents - Functionally, metadata; Generated or processed information adjacent to sources. Used for filtering and monitoring.
2. Source - Stored as object, used as reference and for visual grounding and review.
Model *key fields
Notes
Chunks never assert truth.
Claims never attest content.
Entities never imply facts.
Generally, we never update in place and instead append new artifacts. The except may be for disambiguation.
Conceptually, documents table will be used for raw object tracking and reference but little more. The chunks serve the purpose of "mentions" but do a bit more as well. Entire sources will be deconstructed into small-to-big chunks so we aren't storing raw text redundantly. Full document text is easy enough to reconstruct with this model as well as semantically relevant portions. With that said, Claims, Entities, and Chunks are all first-class-citizens in this model and will all, later, have embedded equivalents for semantic RAG in addition to the lexical search possible given this table field model.
Overview
This knowledge acquisition system robustly compiles and relates knowledge held within source documents in a structured and scalable format. Capable of:
Beta Was this translation helpful? Give feedback.
All reactions