Definitions:
Source: The knowledge base file from which content and metadata is extracted
Content: Data extracted from a source; generally Text or Image
Metadata: Descriptive data which can be associated with Sources, Content(Image or Text); metadata can be extracted from Source/Content, or generated using models, heuristics, etc
Field | Description | Method | |
---|---|---|---|
Content | Content | Content extracted from Source | Extracted |
Source Metadata | Source Name | Name of source | Extracted |
Source ID | ID of source | Extracted | |
Source location | URL, URI, pointer to storage location | N/A | |
Source Type | PDF, HTML, Docx, TXT, PPTx | Extracted | |
Collection ID | Collection in which the source is contained | N/A | |
Date Created | Date source was created | Extracted | |
Last Modified | Date source was last modified | Extracted | |
Summary | Summarization of Source Doc | Generated | |
Partition ID | Offset of this data fragment within a larger set of fragments | Generated | |
Access Level | Dictates RBAC | N/A | |
Content Metadata (applicable to all content types) | Type | Text, Image, Structured, Table, Chart | Generated |
Description | Text Description of the content object (Image/Table) | Generated | |
Page # | Page # where content is contained in source | Extracted | |
Hierarchy | Location/order of content within the source document | Extracted | |
Subtype | For structured data subtypes - table, chart, etc.. | ||
Text Metadata | Text Type | Header, body, etc | Extracted |
Summary | Abbreviated Summary of content | Generated | |
Keywords | Keywords, Named Entities, or other phrases | Extracted | |
Language | Generated | ||
Image Metadata | Image Type | Structured, Natural,Hybrid, etc | Generated (Classifier) |
Structured Image Type | Bar Chart, Pie Chart, etc | Generated (Classifier) | |
Caption | Any caption or subheader associated with Image | Extracted | |
Text | Extracted text from a structured chart | Extracted | |
Image location | Location (x,y) of chart within an image | Extracted | |
uploaded_image_uri | Mirrors source_metadata.source_location | ||
Table Metadata (tables within documents) | Table format | Structured (dataframe / lists of rows and columns), or serialized as markdown, html, latex, simple (cells separated just as spaces) | Extracted |
Table content | Extracted text content, formatted according to table_metadata.table_format. Important: Tables should not be chunked | Extracted | |
Table location | Bounding box of the table | Extracted | |
Caption | Detected captions for the table/chart | Extracted | |
Title | TODO | Extracted | |
Subtitle | TODO | Extracted | |
Axis | TODO | Extracted | |
uploaded_image_uri | Mirrors source_metadata.source_location | Generated |