Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions docs/explanation/document-roles.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# Document Structure

Documents and pages are composed of logical parts, e.g. headings, footnotes, tables, images, headers, and footers, that describe *what content is*, rather than *how it is visually presented*. These logical parts are independent of layout, styling, or rendering medium.

We refer to this logical classification as **document structure**, or more generally as **document semantics**. In this context, *semantics* describe the role a piece of content plays within the document, as defined by the [W3C Web Accessibility Initiative definition of semantics](https://www.w3.org/TR/wai-aria/#dfn-semantics).

Following the terminology of the [W3C Web Accessibility Initiative (WAI)](https://www.w3.org/WAI/standards-guidelines/aria/), each logical part of a document is assigned a **role**.
Roles prefixed with `doc-` identify **document structure roles**, while other roles describe **in-page or interactive structures**.

Document structure roles describe elements that organize and contextualize content within a page or publication. These roles are typically **non-interactive** and convey structural meaning rather than behavior.

**Roles** are mutuated from [Accessible Rich Internet Applications (WAI-ARIA)](https://www.w3.org/TR/wai-aria/) and the [Digital Publishing WAI-ARIA Module](https://www.w3.org/TR/dpub-aria-1.1/), specifically suited for structural divisions of long-form documents.

Because many document parsers and extraction tools use inconsistent or undocumented classification schemes, we maintain explicit **mappings from parser-specific categories to WAI-ARIA roles**. For this reason, we use the `role` field (rather than `category`) to represent document structure semantics in a standardized way. It is important to note that parsers might not support all roles.


## Page level roles

Page level roles applies to parts of a page (although some of them can span across pages). Page roles are a subset of [WAI-ARIA Document Structure Roles](https://www.w3.org/TR/wai-aria/#document_structure_roles)[^1].

[^1]: The term document structure might confuse you as ARIA roles comes from accessibility in a HTML document, so document in that context refers to an element of the markup language containing content that assistive technology users may want to browse in a reading mode.

- **blockquote**: A section of content quoted from another source.
- **caption**: Visible content that names and may describe a figure or a table.
- **definition**: Content that provides the meaning or explanation of a term.
- **deletion**: Content that has been removed or marked as deleted from the document.
- **emphasis**: Text that is emphasized to convey stress or importance in spoken or written language.
- **figure**: A self-contained unit of content, such as an image, diagram, code example, or illustration, often referenced from the main text.
- **generic**: Content that does not convey a specific semantic role beyond grouping or containment.
- **heading**: A label or title that introduces and describes the topic of a section.
- **insertion**: Content that has been added or inserted into the document.
- **list**: A collection of related items presented in a sequential or grouped form.
- **listitem**: A single item within a list.
- **math**: Mathematical expressions or formulas represented in a structured form.
- **paragraph**: A distinct block of text that presents a single idea or unit of discourse.
- **row**: A horizontal grouping of cells within a table or grid.
- **strong**: Text of strong importance, seriousness, or urgency.
- **subscript**: Text displayed below the baseline, typically used in mathematical or chemical notation.
- **superscript**: Text displayed above the baseline, commonly used for exponents, references, or annotations.
- **table**: A structured arrangement of data organized into rows and columns.





## Document level roles

Document level roles applies to parts of a document. Document roles are the same as defined in the [Digital Publishing WAI-ARIA Module](https://www.w3.org/TR/dpub-aria-1.1/).


- **doc-abstract**: A short summary of the principal ideas, concepts, and conclusions of the work, or of a section or excerpt within it.
- **doc-acknowledgments**: A section or statement that acknowledges significant contributions by persons, organizations, governments, and other entities to the realization of the work.
- **doc-afterword**: A closing statement from the author or a person of importance, typically providing insight into how the content came to be written, its significance, or related events that have transpired since its timeline.
- **doc-appendix**: A section of supplemental information located after the primary content that informs the content but is not central to it.
- **doc-backlink**: A link that allows the user to return to a related location in the content (e.g., from a footnote to its reference or from a glossary definition to where a term is used).
- **doc-bibliography**: A list of external references cited in the work, which could be to print or digital sources.
- **doc-biblioref**: A reference to a bibliography entry.
- **doc-chapter**: A major thematic section of content in a work.
- **doc-colophon**: A short section of production notes particular to the edition (e.g., describing the typeface used), often located at the end of a work.
- **doc-conclusion**: A concluding section or statement that summarizes the work or wraps up the narrative.
- **doc-cover**: An image that sets the mood or tone for the work and typically includes the title and author.
- **doc-credit**: An acknowledgment of the source of integrated content from third-party sources, such as photos. Typically identifies the creator, copyright, and any restrictions on reuse.
- **doc-credits**: A collection of credits.
- **doc-dedication**: An inscription at the front of the work, typically addressed in tribute to one or more persons close to the author.
- **doc-endnotes**: A collection of notes at the end of a work or a section within it.
- **doc-epigraph**: A quotation set at the start of the work or a section that establishes the theme or sets the mood.
- **doc-epilogue**: A concluding section of narrative that wraps up or comments on the actions and events of the work, typically from a future perspective.
- **doc-errata**: A set of corrections discovered after initial publication of the work, sometimes referred to as corrigenda.
- **doc-example**: An illustration of a key concept of the work, such as a code listing, case study or problem.
- **doc-footnote**: Ancillary information, such as a citation or commentary, that provides additional context to a referenced passage of text.
- **doc-foreword**: An introductory section that precedes the work, typically not written by the author of the work.
- **doc-glossary**: A brief dictionary of new, uncommon, or specialized terms used in the content.
- **doc-glossref**: A reference to a glossary definition.
- **doc-index**: A navigational aid that provides a detailed list of links to key subjects, names and other important topics covered in the work.
- **doc-introduction**: A preliminary section that typically introduces the scope or nature of the work.
- **doc-noteref**: A reference to a footnote or endnote, typically appearing as a superscripted number or symbol in the main body of text.
- **doc-notice**: Notifies the user of consequences that might arise from an action or event. Examples include warnings, cautions and dangers.
- **doc-pagebreak**: A separator denoting the position before which a break occurs between two contiguous pages in a statically paginated version of the content.
- **doc-pagefooter**: A section of text appearing at the bottom of a page that provides context about the current work and location within it. The page footer is distinct from the body text and normally follows a repeating template that contains (possibly truncated) items such as the document title, current section, author name(s), and page number.
- **doc-pageheader**: A section of text appearing at the top of a page that provides context about the current work and location within it. The page header is distinct from the body text and normally follows a repeating template that contains (possibly truncated) items such as the document title, current section, author name(s), and page number.
- **doc-pagelist**: A navigational aid that provides a list of links to the page breaks in the content.
- **doc-part**: A major structural division in a work that contains a set of related sections dealing with a particular subject, narrative arc, or similar encapsulated theme.
- **doc-preface**: An introductory section that precedes the work, typically written by the author of the work.
- **doc-prologue**: An introductory section that sets the background to a work, typically part of the narrative.
- **doc-pullquote**: A distinctively placed or highlighted quotation from the current content designed to draw attention to a topic or highlight a key point.
- **doc-qna**: A section of content structured as a series of questions and answers, such as an interview or list of frequently asked questions.
- **doc-subtitle**: An explanatory or alternate title for the work, or a section or component within it.
- **doc-tip**: Helpful information that clarifies some aspect of the content or assists in its comprehension.
- **doc-toc**: A navigational aid that provides an ordered list of links to the major sectional headings in the content. A table of contents could cover an entire work or only a smaller section of it.


Considering that in ARIA title comes from the HTML document we decided to add a non-standard role `doc-title` to represent the document title.


41 changes: 39 additions & 2 deletions src/parxy_core/drivers/landingai.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,41 @@
from parxy_core.models import Document, Metadata, TextBlock, Page, BoundingBox
from parxy_core.utils import safe_json_dumps

# Mapping from LandingAI ADE chunk types to WAI-ARIA document structure roles
# See docs/explanation/document-roles.md for role definitions
LANDINGAI_TO_ROLE: dict[str, str] = {
'text': 'paragraph',
'table': 'table',
'marginalia': 'generic', # Mixed content in margins - too generic to map precisely
'figure': 'figure',
'logo': 'figure', # DPT-2 only: logos are visual elements
'card': 'figure', # DPT-2 only: ID cards, driver licenses
'attestation': 'figure', # DPT-2 only: signatures, stamps, seals
'scan_code': 'figure', # DPT-2 only: QR codes, barcodes
# Footer variants
'page-footer': 'doc-pagefooter',
'page_footer': 'doc-pagefooter',
'footer': 'doc-pagefooter',
'page-number': 'doc-pagefooter',
# Footnote variants
'footnote': 'doc-footnote',
'note': 'doc-footnote',
'endnote': 'doc-endnotes',
'annotation': 'doc-footnote',
'footer-note': 'doc-footnote',
# Heading variants
'heading': 'heading',
'title': 'doc-title',
'subtitle': 'doc-subtitle',
'section': 'heading',
'chapter': 'doc-chapter',
# Header variants
'page-header': 'doc-pageheader',
'page_header': 'doc-pageheader',
'page-heading': 'doc-pageheader',
'header': 'doc-pageheader',
}


class LandingAIADEDriver(Driver):
def _initialize_driver(self):
Expand Down Expand Up @@ -134,7 +169,8 @@ def landingaiade_to_parxy(parsed_data: ParseResponse) -> Document:
chunk_text = chunk.markdown

page_text_parts.append(chunk_text)
chunk_type = chunk.type
category = chunk.type
role = LANDINGAI_TO_ROLE.get(category, 'generic') if category else 'generic'

# Get bounding box from first grounding
bbox = None
Expand All @@ -147,9 +183,10 @@ def landingaiade_to_parxy(parsed_data: ParseResponse) -> Document:
# Create the appropriate block type
block = TextBlock(
type='text',
role=role,
bbox=bbox,
page=page,
category=chunk_type,
category=category,
text=chunk_text,
source_data=chunk.model_dump(),
)
Expand Down
40 changes: 39 additions & 1 deletion src/parxy_core/drivers/llamaparse.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,41 @@
FileNotFoundException,
)

# Mapping from LlamaParse types to WAI-ARIA document structure roles
# See docs/explanation/document-roles.md for role definitions
LLAMAPARSE_TO_ROLE: dict[str, str] = {
'text': 'paragraph',
'table': 'table',
'tables': 'table',
'figure': 'figure',
'figures': 'figure',
'list': 'list',
'lists': 'list',
# Footer variants
'footer': 'doc-pagefooter',
'page-footer': 'doc-pagefooter',
'page_footer': 'doc-pagefooter',
'page-number': 'doc-pagefooter',
# Footnote variants
'footnote': 'doc-footnote',
'note': 'doc-footnote',
'endnote': 'doc-endnotes',
'annotation': 'doc-footnote',
'footer-note': 'doc-footnote',
# Heading variants
'heading': 'heading',
'title': 'doc-title',
'titles': 'heading',
'subtitle': 'doc-subtitle',
'section': 'heading',
'chapter': 'doc-chapter',
# Header variants
'page-header': 'doc-pageheader',
'page_header': 'doc-pageheader',
'page-heading': 'doc-pageheader',
'header': 'doc-pageheader',
}

_credits_per_parsing_mode = {
# Minimum credits per parsing mode as deduced from https://developers.llamaindex.ai/python/cloud/general/pricing/
'accurate': 3, # equivalent to Parse page with LLM as observed in their dashboard
Expand Down Expand Up @@ -453,9 +488,12 @@ def _convert_text_block(text_block: PageItem, page_number: int) -> TextBlock:
text_value = text_block.value if text_block.value else ''
if text_value == 'NO_CONTENT_HERE':
text_value = ''
category = text_block.type
role = LLAMAPARSE_TO_ROLE.get(category, 'generic') if category else 'generic'
return TextBlock(
type='text',
category=text_block.type,
role=role,
category=category,
level=text_block.lvl,
text=text_value,
bbox=bbox,
Expand Down
46 changes: 46 additions & 0 deletions src/parxy_core/drivers/pdfact.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,50 @@
from parxy_core.models.config import PdfActConfig
from parxy_core.tracing.utils import trace_with_output

# Mapping from PdfAct categories to WAI-ARIA document structure roles
# See docs/explanation/document-roles.md for role definitions
PDFACT_TO_ROLE: dict[str, str] = {
'abstract': 'doc-abstract',
'acknowledgments': 'doc-acknowledgments',
'affiliation': 'generic',
'appendix': 'doc-appendix',
'authors': 'generic',
'body': 'paragraph',
'caption': 'caption',
'categories': 'generic',
'figure': 'figure',
'formula': 'math',
'general-terms': 'generic',
'itemize-item': 'listitem',
'keywords': 'generic',
'other': 'generic',
'reference': 'doc-biblioref',
'table': 'table',
'toc': 'doc-toc',
# Footer variants
'footer': 'doc-pagefooter',
'page-footer': 'doc-pagefooter',
'page_footer': 'doc-pagefooter',
'page-number': 'doc-pagefooter',
# Footnote variants
'footnote': 'doc-footnote',
'note': 'doc-footnote',
'endnote': 'doc-endnotes',
'annotation': 'doc-footnote',
'footer-note': 'doc-footnote',
# Heading variants
'heading': 'heading',
'title': 'doc-title',
'subtitle': 'doc-subtitle',
'section': 'heading',
'chapter': 'doc-chapter',
# Header variants
'page-header': 'doc-pageheader',
'page_header': 'doc-pageheader',
'page-heading': 'doc-pageheader',
'header': 'doc-pageheader',
}


class PdfActDriver(Driver):
"""PdfAct service driver.
Expand Down Expand Up @@ -287,6 +331,7 @@ def _convert_text_block(
data = text_block.get('paragraph')
text = data.get('text', '')
category = data.get('role') if 'role' in data else None
role = PDFACT_TO_ROLE.get(category, 'generic') if category else 'generic'
positions = data.get('positions', [])

# Convert font and color
Expand All @@ -308,6 +353,7 @@ def _convert_text_block(
bbox = _convert_bbox([pos for pos in positions if pos['page'] == page])
return TextBlock(
type='text',
role=role,
bbox=bbox,
style=style,
page=page,
Expand Down
11 changes: 5 additions & 6 deletions src/parxy_core/models/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,17 +60,16 @@ def isEmpty(self) -> bool:

class Block(BaseModel, ABC):
type: str
role: Optional[str] = 'generic'
"""Document Structure role recognized for this block"""
bbox: Optional[BoundingBox] = None
page: Optional[int] = None
source_data: Optional[dict[str, Any]] = None
category: Optional[str] = None
"""Category attributed to this block by the parser"""


class TextBlock(BaseModel):
type: str
bbox: Optional[BoundingBox] = None
page: Optional[int] = None
source_data: Optional[dict[str, Any]] = None
category: Optional[str] = None
class TextBlock(Block):
style: Optional[Style] = None
level: Optional[int] = None
lines: Optional[List[Line]] = None
Expand Down