OneOffTech · avvertix · Jan 20, 2026 · Jan 19, 2026 · Jan 19, 2026 · Jan 19, 2026
diff --git a/docs/explanation/document-roles.md b/docs/explanation/document-roles.md
@@ -0,0 +1,94 @@
+# Document Structure
+
+Documents and pages are composed of logical parts, e.g. headings, footnotes, tables, images, headers, and footers, that describe *what content is*, rather than *how it is visually presented*. These logical parts are independent of layout, styling, or rendering medium.
+
+We refer to this logical classification as **document structure**, or more generally as **document semantics**. In this context, *semantics* describe the role a piece of content plays within the document, as defined by the [W3C Web Accessibility Initiative definition of semantics](https://www.w3.org/TR/wai-aria/#dfn-semantics).
+
+Following the terminology of the [W3C Web Accessibility Initiative (WAI)](https://www.w3.org/WAI/standards-guidelines/aria/), each logical part of a document is assigned a **role**.
+Roles prefixed with `doc-` identify **document structure roles**, while other roles describe **in-page or interactive structures**.
+
+Document structure roles describe elements that organize and contextualize content within a page or publication. These roles are typically **non-interactive** and convey structural meaning rather than behavior.
+
+**Roles** are mutuated from [Accessible Rich Internet Applications (WAI-ARIA)](https://www.w3.org/TR/wai-aria/) and the [Digital Publishing WAI-ARIA Module](https://www.w3.org/TR/dpub-aria-1.1/), specifically suited for structural divisions of long-form documents.
+
+Because many document parsers and extraction tools use inconsistent or undocumented classification schemes, we maintain explicit **mappings from parser-specific categories to WAI-ARIA roles**. For this reason, we use the `role` field (rather than `category`) to represent document structure semantics in a standardized way. It is important to note that parsers might not support all roles.
+
+
+## Page level roles
+
+Page level roles applies to parts of a page (although some of them can span across pages). Page roles are a subset of [WAI-ARIA Document Structure Roles](https://www.w3.org/TR/wai-aria/#document_structure_roles)[^1].
+
+[^1]: The term document structure might confuse you as ARIA roles comes from accessibility in a HTML document, so document in that context refers to an element of the markup language containing content that assistive technology users may want to browse in a reading mode.
+
+- **blockquote**: A section of content quoted from another source.
+- **caption**: Visible content that names and may describe a figure or a table.
+- **definition**: Content that provides the meaning or explanation of a term.
+- **deletion**: Content that has been removed or marked as deleted from the document.
+- **emphasis**: Text that is emphasized to convey stress or importance in spoken or written language.
+- **figure**: A self-contained unit of content, such as an image, diagram, code example, or illustration, often referenced from the main text.
+- **generic**: Content that does not convey a specific semantic role beyond grouping or containment.
+- **heading**: A label or title that introduces and describes the topic of a section.
+- **insertion**: Content that has been added or inserted into the document.
+- **list**: A collection of related items presented in a sequential or grouped form.
+- **listitem**: A single item within a list.
+- **math**: Mathematical expressions or formulas represented in a structured form.
+- **paragraph**: A distinct block of text that presents a single idea or unit of discourse.
+- **row**: A horizontal grouping of cells within a table or grid.
+- **strong**: Text of strong importance, seriousness, or urgency.
+- **subscript**: Text displayed below the baseline, typically used in mathematical or chemical notation.
+- **superscript**: Text displayed above the baseline, commonly used for exponents, references, or annotations.
+- **table**: A structured arrangement of data organized into rows and columns.
+
+
+
+
+
+## Document level roles
+
+Document level roles applies to parts of a document. Document roles are the same as defined in the [Digital Publishing WAI-ARIA Module](https://www.w3.org/TR/dpub-aria-1.1/).
+
+
+- **doc-abstract**: A short summary of the principal ideas, concepts, and conclusions of the work, or of a section or excerpt within it.
+- **doc-acknowledgments**: A section or statement that acknowledges significant contributions by persons, organizations, governments, and other entities to the realization of the work.
+- **doc-afterword**: A closing statement from the author or a person of importance, typically providing insight into how the content came to be written, its significance, or related events that have transpired since its timeline.
+- **doc-appendix**: A section of supplemental information located after the primary content that informs the content but is not central to it.
+- **doc-backlink**: A link that allows the user to return to a related location in the content (e.g., from a footnote to its reference or from a glossary definition to where a term is used).
+- **doc-bibliography**: A list of external references cited in the work, which could be to print or digital sources.
+- **doc-biblioref**: A reference to a bibliography entry.
+- **doc-chapter**: A major thematic section of content in a work.
+- **doc-colophon**: A short section of production notes particular to the edition (e.g., describing the typeface used), often located at the end of a work.
+- **doc-conclusion**: A concluding section or statement that summarizes the work or wraps up the narrative.
+- **doc-cover**: An image that sets the mood or tone for the work and typically includes the title and author.
+- **doc-credit**: An acknowledgment of the source of integrated content from third-party sources, such as photos. Typically identifies the creator, copyright, and any restrictions on reuse.
+- **doc-credits**: A collection of credits.
+- **doc-dedication**: An inscription at the front of the work, typically addressed in tribute to one or more persons close to the author.
+- **doc-endnotes**: A collection of notes at the end of a work or a section within it.
+- **doc-epigraph**: A quotation set at the start of the work or a section that establishes the theme or sets the mood.
+- **doc-epilogue**: A concluding section of narrative that wraps up or comments on the actions and events of the work, typically from a future perspective.
+- **doc-errata**: A set of corrections discovered after initial publication of the work, sometimes referred to as corrigenda.
+- **doc-example**: An illustration of a key concept of the work, such as a code listing, case study or problem.
+- **doc-footnote**: Ancillary information, such as a citation or commentary, that provides additional context to a referenced passage of text.
+- **doc-foreword**: An introductory section that precedes the work, typically not written by the author of the work.
+- **doc-glossary**: A brief dictionary of new, uncommon, or specialized terms used in the content.
+- **doc-glossref**: A reference to a glossary definition.
+- **doc-index**: A navigational aid that provides a detailed list of links to key subjects, names and other important topics covered in the work.
+- **doc-introduction**: A preliminary section that typically introduces the scope or nature of the work.
+- **doc-noteref**: A reference to a footnote or endnote, typically appearing as a superscripted number or symbol in the main body of text.
+- **doc-notice**: Notifies the user of consequences that might arise from an action or event. Examples include warnings, cautions and dangers.
+- **doc-pagebreak**: A separator denoting the position before which a break occurs between two contiguous pages in a statically paginated version of the content.
+- **doc-pagefooter**: A section of text appearing at the bottom of a page that provides context about the current work and location within it. The page footer is distinct from the body text and normally follows a repeating template that contains (possibly truncated) items such as the document title, current section, author name(s), and page number.
+- **doc-pageheader**: A section of text appearing at the top of a page that provides context about the current work and location within it. The page header is distinct from the body text and normally follows a repeating template that contains (possibly truncated) items such as the document title, current section, author name(s), and page number.
+- **doc-pagelist**: A navigational aid that provides a list of links to the page breaks in the content.
+- **doc-part**: A major structural division in a work that contains a set of related sections dealing with a particular subject, narrative arc, or similar encapsulated theme.
+- **doc-preface**: An introductory section that precedes the work, typically written by the author of the work.
+- **doc-prologue**: An introductory section that sets the background to a work, typically part of the narrative.
+- **doc-pullquote**: A distinctively placed or highlighted quotation from the current content designed to draw attention to a topic or highlight a key point.
+- **doc-qna**: A section of content structured as a series of questions and answers, such as an interview or list of frequently asked questions.
+- **doc-subtitle**: An explanatory or alternate title for the work, or a section or component within it.
+- **doc-tip**: Helpful information that clarifies some aspect of the content or assists in its comprehension.
+- **doc-toc**: A navigational aid that provides an ordered list of links to the major sectional headings in the content. A table of contents could cover an entire work or only a smaller section of it.
+
+
+Considering that in ARIA title comes from the HTML document we decided to add a non-standard role `doc-title` to represent the document title.
+
+
diff --git a/src/parxy_core/drivers/landingai.py b/src/parxy_core/drivers/landingai.py
@@ -18,6 +18,41 @@
 from parxy_core.models import Document, Metadata, TextBlock, Page, BoundingBox
 from parxy_core.utils import safe_json_dumps
 
+# Mapping from LandingAI ADE chunk types to WAI-ARIA document structure roles
+# See docs/explanation/document-roles.md for role definitions
+LANDINGAI_TO_ROLE: dict[str, str] = {
+    'text': 'paragraph',
+    'table': 'table',
+    'marginalia': 'generic',  # Mixed content in margins - too generic to map precisely
+    'figure': 'figure',
+    'logo': 'figure',  # DPT-2 only: logos are visual elements
+    'card': 'figure',  # DPT-2 only: ID cards, driver licenses
+    'attestation': 'figure',  # DPT-2 only: signatures, stamps, seals
+    'scan_code': 'figure',  # DPT-2 only: QR codes, barcodes
+    # Footer variants
+    'page-footer': 'doc-pagefooter',
+    'page_footer': 'doc-pagefooter',
+    'footer': 'doc-pagefooter',
+    'page-number': 'doc-pagefooter',
+    # Footnote variants
+    'footnote': 'doc-footnote',
+    'note': 'doc-footnote',
+    'endnote': 'doc-endnotes',
+    'annotation': 'doc-footnote',
+    'footer-note': 'doc-footnote',
+    # Heading variants
+    'heading': 'heading',
+    'title': 'doc-title',
+    'subtitle': 'doc-subtitle',
+    'section': 'heading',
+    'chapter': 'doc-chapter',
+    # Header variants
+    'page-header': 'doc-pageheader',
+    'page_header': 'doc-pageheader',
+    'page-heading': 'doc-pageheader',
+    'header': 'doc-pageheader',
+}
+
 
 class LandingAIADEDriver(Driver):
     def _initialize_driver(self):
@@ -134,7 +169,8 @@ def landingaiade_to_parxy(parsed_data: ParseResponse) -> Document:
             chunk_text = chunk.markdown
 
             page_text_parts.append(chunk_text)
-            chunk_type = chunk.type
+            category = chunk.type
+            role = LANDINGAI_TO_ROLE.get(category, 'generic') if category else 'generic'
 
             # Get bounding box from first grounding
             bbox = None
@@ -147,9 +183,10 @@ def landingaiade_to_parxy(parsed_data: ParseResponse) -> Document:
             # Create the appropriate block type
             block = TextBlock(
                 type='text',
+                role=role,
                 bbox=bbox,
                 page=page,
-                category=chunk_type,
+                category=category,
                 text=chunk_text,
                 source_data=chunk.model_dump(),
             )

diff --git a/src/parxy_core/drivers/llamaparse.py b/src/parxy_core/drivers/llamaparse.py
@@ -26,6 +26,41 @@
     FileNotFoundException,
 )
 
+# Mapping from LlamaParse types to WAI-ARIA document structure roles
+# See docs/explanation/document-roles.md for role definitions
+LLAMAPARSE_TO_ROLE: dict[str, str] = {
+    'text': 'paragraph',
+    'table': 'table',
+    'tables': 'table',
+    'figure': 'figure',
+    'figures': 'figure',
+    'list': 'list',
+    'lists': 'list',
+    # Footer variants
+    'footer': 'doc-pagefooter',
+    'page-footer': 'doc-pagefooter',
+    'page_footer': 'doc-pagefooter',
+    'page-number': 'doc-pagefooter',
+    # Footnote variants
+    'footnote': 'doc-footnote',
+    'note': 'doc-footnote',
+    'endnote': 'doc-endnotes',
+    'annotation': 'doc-footnote',
+    'footer-note': 'doc-footnote',
+    # Heading variants
+    'heading': 'heading',
+    'title': 'doc-title',
+    'titles': 'heading',
+    'subtitle': 'doc-subtitle',
+    'section': 'heading',
+    'chapter': 'doc-chapter',
+    # Header variants
+    'page-header': 'doc-pageheader',
+    'page_header': 'doc-pageheader',
+    'page-heading': 'doc-pageheader',
+    'header': 'doc-pageheader',
+}
+
 _credits_per_parsing_mode = {
     # Minimum credits per parsing mode as deduced from https://developers.llamaindex.ai/python/cloud/general/pricing/
     'accurate': 3,  # equivalent to Parse page with LLM as observed in their dashboard
@@ -453,9 +488,12 @@ def _convert_text_block(text_block: PageItem, page_number: int) -> TextBlock:
     text_value = text_block.value if text_block.value else ''
     if text_value == 'NO_CONTENT_HERE':
         text_value = ''
+    category = text_block.type
+    role = LLAMAPARSE_TO_ROLE.get(category, 'generic') if category else 'generic'
     return TextBlock(
         type='text',
-        category=text_block.type,
+        role=role,
+        category=category,
         level=text_block.lvl,
         text=text_value,
         bbox=bbox,

diff --git a/src/parxy_core/drivers/pdfact.py b/src/parxy_core/drivers/pdfact.py
@@ -21,6 +21,50 @@
 from parxy_core.models.config import PdfActConfig
 from parxy_core.tracing.utils import trace_with_output
 
+# Mapping from PdfAct categories to WAI-ARIA document structure roles
+# See docs/explanation/document-roles.md for role definitions
+PDFACT_TO_ROLE: dict[str, str] = {
+    'abstract': 'doc-abstract',
+    'acknowledgments': 'doc-acknowledgments',
+    'affiliation': 'generic',
+    'appendix': 'doc-appendix',
+    'authors': 'generic',
+    'body': 'paragraph',
+    'caption': 'caption',
+    'categories': 'generic',
+    'figure': 'figure',
+    'formula': 'math',
+    'general-terms': 'generic',
+    'itemize-item': 'listitem',
+    'keywords': 'generic',
+    'other': 'generic',
+    'reference': 'doc-biblioref',
+    'table': 'table',
+    'toc': 'doc-toc',
+    # Footer variants
+    'footer': 'doc-pagefooter',
+    'page-footer': 'doc-pagefooter',
+    'page_footer': 'doc-pagefooter',
+    'page-number': 'doc-pagefooter',
+    # Footnote variants
+    'footnote': 'doc-footnote',
+    'note': 'doc-footnote',
+    'endnote': 'doc-endnotes',
+    'annotation': 'doc-footnote',
+    'footer-note': 'doc-footnote',
+    # Heading variants
+    'heading': 'heading',
+    'title': 'doc-title',
+    'subtitle': 'doc-subtitle',
+    'section': 'heading',
+    'chapter': 'doc-chapter',
+    # Header variants
+    'page-header': 'doc-pageheader',
+    'page_header': 'doc-pageheader',
+    'page-heading': 'doc-pageheader',
+    'header': 'doc-pageheader',
+}
+
 
 class PdfActDriver(Driver):
     """PdfAct service driver.
@@ -287,6 +331,7 @@ def _convert_text_block(
     data = text_block.get('paragraph')
     text = data.get('text', '')
     category = data.get('role') if 'role' in data else None
+    role = PDFACT_TO_ROLE.get(category, 'generic') if category else 'generic'
     positions = data.get('positions', [])
 
     # Convert font and color
@@ -308,6 +353,7 @@ def _convert_text_block(
     bbox = _convert_bbox([pos for pos in positions if pos['page'] == page])
     return TextBlock(
         type='text',
+        role=role,
         bbox=bbox,
         style=style,
         page=page,

diff --git a/src/parxy_core/models/models.py b/src/parxy_core/models/models.py
@@ -60,17 +60,16 @@ def isEmpty(self) -> bool:
 
 class Block(BaseModel, ABC):
     type: str
+    role: Optional[str] = 'generic'
+    """Document Structure role recognized for this block"""
     bbox: Optional[BoundingBox] = None
     page: Optional[int] = None
     source_data: Optional[dict[str, Any]] = None
+    category: Optional[str] = None
+    """Category attributed to this block by the parser"""
 
 
-class TextBlock(BaseModel):
-    type: str
-    bbox: Optional[BoundingBox] = None
-    page: Optional[int] = None
-    source_data: Optional[dict[str, Any]] = None
-    category: Optional[str] = None
+class TextBlock(Block):
     style: Optional[Style] = None
     level: Optional[int] = None
     lines: Optional[List[Line]] = None