diff --git a/docs/explanation/document-roles.md b/docs/explanation/document-roles.md new file mode 100644 index 0000000..8737f38 --- /dev/null +++ b/docs/explanation/document-roles.md @@ -0,0 +1,94 @@ +# Document Structure + +Documents and pages are composed of logical parts, e.g. headings, footnotes, tables, images, headers, and footers, that describe *what content is*, rather than *how it is visually presented*. These logical parts are independent of layout, styling, or rendering medium. + +We refer to this logical classification as **document structure**, or more generally as **document semantics**. In this context, *semantics* describe the role a piece of content plays within the document, as defined by the [W3C Web Accessibility Initiative definition of semantics](https://www.w3.org/TR/wai-aria/#dfn-semantics). + +Following the terminology of the [W3C Web Accessibility Initiative (WAI)](https://www.w3.org/WAI/standards-guidelines/aria/), each logical part of a document is assigned a **role**. +Roles prefixed with `doc-` identify **document structure roles**, while other roles describe **in-page or interactive structures**. + +Document structure roles describe elements that organize and contextualize content within a page or publication. These roles are typically **non-interactive** and convey structural meaning rather than behavior. + +**Roles** are mutuated from [Accessible Rich Internet Applications (WAI-ARIA)](https://www.w3.org/TR/wai-aria/) and the [Digital Publishing WAI-ARIA Module](https://www.w3.org/TR/dpub-aria-1.1/), specifically suited for structural divisions of long-form documents. + +Because many document parsers and extraction tools use inconsistent or undocumented classification schemes, we maintain explicit **mappings from parser-specific categories to WAI-ARIA roles**. For this reason, we use the `role` field (rather than `category`) to represent document structure semantics in a standardized way. It is important to note that parsers might not support all roles. + + +## Page level roles + +Page level roles applies to parts of a page (although some of them can span across pages). Page roles are a subset of [WAI-ARIA Document Structure Roles](https://www.w3.org/TR/wai-aria/#document_structure_roles)[^1]. + +[^1]: The term document structure might confuse you as ARIA roles comes from accessibility in a HTML document, so document in that context refers to an element of the markup language containing content that assistive technology users may want to browse in a reading mode. + +- **blockquote**: A section of content quoted from another source. +- **caption**: Visible content that names and may describe a figure or a table. +- **definition**: Content that provides the meaning or explanation of a term. +- **deletion**: Content that has been removed or marked as deleted from the document. +- **emphasis**: Text that is emphasized to convey stress or importance in spoken or written language. +- **figure**: A self-contained unit of content, such as an image, diagram, code example, or illustration, often referenced from the main text. +- **generic**: Content that does not convey a specific semantic role beyond grouping or containment. +- **heading**: A label or title that introduces and describes the topic of a section. +- **insertion**: Content that has been added or inserted into the document. +- **list**: A collection of related items presented in a sequential or grouped form. +- **listitem**: A single item within a list. +- **math**: Mathematical expressions or formulas represented in a structured form. +- **paragraph**: A distinct block of text that presents a single idea or unit of discourse. +- **row**: A horizontal grouping of cells within a table or grid. +- **strong**: Text of strong importance, seriousness, or urgency. +- **subscript**: Text displayed below the baseline, typically used in mathematical or chemical notation. +- **superscript**: Text displayed above the baseline, commonly used for exponents, references, or annotations. +- **table**: A structured arrangement of data organized into rows and columns. + + + + + +## Document level roles + +Document level roles applies to parts of a document. Document roles are the same as defined in the [Digital Publishing WAI-ARIA Module](https://www.w3.org/TR/dpub-aria-1.1/). + + +- **doc-abstract**: A short summary of the principal ideas, concepts, and conclusions of the work, or of a section or excerpt within it. +- **doc-acknowledgments**: A section or statement that acknowledges significant contributions by persons, organizations, governments, and other entities to the realization of the work. +- **doc-afterword**: A closing statement from the author or a person of importance, typically providing insight into how the content came to be written, its significance, or related events that have transpired since its timeline. +- **doc-appendix**: A section of supplemental information located after the primary content that informs the content but is not central to it. +- **doc-backlink**: A link that allows the user to return to a related location in the content (e.g., from a footnote to its reference or from a glossary definition to where a term is used). +- **doc-bibliography**: A list of external references cited in the work, which could be to print or digital sources. +- **doc-biblioref**: A reference to a bibliography entry. +- **doc-chapter**: A major thematic section of content in a work. +- **doc-colophon**: A short section of production notes particular to the edition (e.g., describing the typeface used), often located at the end of a work. +- **doc-conclusion**: A concluding section or statement that summarizes the work or wraps up the narrative. +- **doc-cover**: An image that sets the mood or tone for the work and typically includes the title and author. +- **doc-credit**: An acknowledgment of the source of integrated content from third-party sources, such as photos. Typically identifies the creator, copyright, and any restrictions on reuse. +- **doc-credits**: A collection of credits. +- **doc-dedication**: An inscription at the front of the work, typically addressed in tribute to one or more persons close to the author. +- **doc-endnotes**: A collection of notes at the end of a work or a section within it. +- **doc-epigraph**: A quotation set at the start of the work or a section that establishes the theme or sets the mood. +- **doc-epilogue**: A concluding section of narrative that wraps up or comments on the actions and events of the work, typically from a future perspective. +- **doc-errata**: A set of corrections discovered after initial publication of the work, sometimes referred to as corrigenda. +- **doc-example**: An illustration of a key concept of the work, such as a code listing, case study or problem. +- **doc-footnote**: Ancillary information, such as a citation or commentary, that provides additional context to a referenced passage of text. +- **doc-foreword**: An introductory section that precedes the work, typically not written by the author of the work. +- **doc-glossary**: A brief dictionary of new, uncommon, or specialized terms used in the content. +- **doc-glossref**: A reference to a glossary definition. +- **doc-index**: A navigational aid that provides a detailed list of links to key subjects, names and other important topics covered in the work. +- **doc-introduction**: A preliminary section that typically introduces the scope or nature of the work. +- **doc-noteref**: A reference to a footnote or endnote, typically appearing as a superscripted number or symbol in the main body of text. +- **doc-notice**: Notifies the user of consequences that might arise from an action or event. Examples include warnings, cautions and dangers. +- **doc-pagebreak**: A separator denoting the position before which a break occurs between two contiguous pages in a statically paginated version of the content. +- **doc-pagefooter**: A section of text appearing at the bottom of a page that provides context about the current work and location within it. The page footer is distinct from the body text and normally follows a repeating template that contains (possibly truncated) items such as the document title, current section, author name(s), and page number. +- **doc-pageheader**: A section of text appearing at the top of a page that provides context about the current work and location within it. The page header is distinct from the body text and normally follows a repeating template that contains (possibly truncated) items such as the document title, current section, author name(s), and page number. +- **doc-pagelist**: A navigational aid that provides a list of links to the page breaks in the content. +- **doc-part**: A major structural division in a work that contains a set of related sections dealing with a particular subject, narrative arc, or similar encapsulated theme. +- **doc-preface**: An introductory section that precedes the work, typically written by the author of the work. +- **doc-prologue**: An introductory section that sets the background to a work, typically part of the narrative. +- **doc-pullquote**: A distinctively placed or highlighted quotation from the current content designed to draw attention to a topic or highlight a key point. +- **doc-qna**: A section of content structured as a series of questions and answers, such as an interview or list of frequently asked questions. +- **doc-subtitle**: An explanatory or alternate title for the work, or a section or component within it. +- **doc-tip**: Helpful information that clarifies some aspect of the content or assists in its comprehension. +- **doc-toc**: A navigational aid that provides an ordered list of links to the major sectional headings in the content. A table of contents could cover an entire work or only a smaller section of it. + + +Considering that in ARIA title comes from the HTML document we decided to add a non-standard role `doc-title` to represent the document title. + + diff --git a/src/parxy_core/drivers/landingai.py b/src/parxy_core/drivers/landingai.py index 9c689af..bc624ac 100644 --- a/src/parxy_core/drivers/landingai.py +++ b/src/parxy_core/drivers/landingai.py @@ -18,6 +18,41 @@ from parxy_core.models import Document, Metadata, TextBlock, Page, BoundingBox from parxy_core.utils import safe_json_dumps +# Mapping from LandingAI ADE chunk types to WAI-ARIA document structure roles +# See docs/explanation/document-roles.md for role definitions +LANDINGAI_TO_ROLE: dict[str, str] = { + 'text': 'paragraph', + 'table': 'table', + 'marginalia': 'generic', # Mixed content in margins - too generic to map precisely + 'figure': 'figure', + 'logo': 'figure', # DPT-2 only: logos are visual elements + 'card': 'figure', # DPT-2 only: ID cards, driver licenses + 'attestation': 'figure', # DPT-2 only: signatures, stamps, seals + 'scan_code': 'figure', # DPT-2 only: QR codes, barcodes + # Footer variants + 'page-footer': 'doc-pagefooter', + 'page_footer': 'doc-pagefooter', + 'footer': 'doc-pagefooter', + 'page-number': 'doc-pagefooter', + # Footnote variants + 'footnote': 'doc-footnote', + 'note': 'doc-footnote', + 'endnote': 'doc-endnotes', + 'annotation': 'doc-footnote', + 'footer-note': 'doc-footnote', + # Heading variants + 'heading': 'heading', + 'title': 'doc-title', + 'subtitle': 'doc-subtitle', + 'section': 'heading', + 'chapter': 'doc-chapter', + # Header variants + 'page-header': 'doc-pageheader', + 'page_header': 'doc-pageheader', + 'page-heading': 'doc-pageheader', + 'header': 'doc-pageheader', +} + class LandingAIADEDriver(Driver): def _initialize_driver(self): @@ -134,7 +169,8 @@ def landingaiade_to_parxy(parsed_data: ParseResponse) -> Document: chunk_text = chunk.markdown page_text_parts.append(chunk_text) - chunk_type = chunk.type + category = chunk.type + role = LANDINGAI_TO_ROLE.get(category, 'generic') if category else 'generic' # Get bounding box from first grounding bbox = None @@ -147,9 +183,10 @@ def landingaiade_to_parxy(parsed_data: ParseResponse) -> Document: # Create the appropriate block type block = TextBlock( type='text', + role=role, bbox=bbox, page=page, - category=chunk_type, + category=category, text=chunk_text, source_data=chunk.model_dump(), ) diff --git a/src/parxy_core/drivers/llamaparse.py b/src/parxy_core/drivers/llamaparse.py index 426f53b..fb6dc35 100644 --- a/src/parxy_core/drivers/llamaparse.py +++ b/src/parxy_core/drivers/llamaparse.py @@ -26,6 +26,41 @@ FileNotFoundException, ) +# Mapping from LlamaParse types to WAI-ARIA document structure roles +# See docs/explanation/document-roles.md for role definitions +LLAMAPARSE_TO_ROLE: dict[str, str] = { + 'text': 'paragraph', + 'table': 'table', + 'tables': 'table', + 'figure': 'figure', + 'figures': 'figure', + 'list': 'list', + 'lists': 'list', + # Footer variants + 'footer': 'doc-pagefooter', + 'page-footer': 'doc-pagefooter', + 'page_footer': 'doc-pagefooter', + 'page-number': 'doc-pagefooter', + # Footnote variants + 'footnote': 'doc-footnote', + 'note': 'doc-footnote', + 'endnote': 'doc-endnotes', + 'annotation': 'doc-footnote', + 'footer-note': 'doc-footnote', + # Heading variants + 'heading': 'heading', + 'title': 'doc-title', + 'titles': 'heading', + 'subtitle': 'doc-subtitle', + 'section': 'heading', + 'chapter': 'doc-chapter', + # Header variants + 'page-header': 'doc-pageheader', + 'page_header': 'doc-pageheader', + 'page-heading': 'doc-pageheader', + 'header': 'doc-pageheader', +} + _credits_per_parsing_mode = { # Minimum credits per parsing mode as deduced from https://developers.llamaindex.ai/python/cloud/general/pricing/ 'accurate': 3, # equivalent to Parse page with LLM as observed in their dashboard @@ -453,9 +488,12 @@ def _convert_text_block(text_block: PageItem, page_number: int) -> TextBlock: text_value = text_block.value if text_block.value else '' if text_value == 'NO_CONTENT_HERE': text_value = '' + category = text_block.type + role = LLAMAPARSE_TO_ROLE.get(category, 'generic') if category else 'generic' return TextBlock( type='text', - category=text_block.type, + role=role, + category=category, level=text_block.lvl, text=text_value, bbox=bbox, diff --git a/src/parxy_core/drivers/pdfact.py b/src/parxy_core/drivers/pdfact.py index 30e8f2d..80c5fde 100644 --- a/src/parxy_core/drivers/pdfact.py +++ b/src/parxy_core/drivers/pdfact.py @@ -21,6 +21,50 @@ from parxy_core.models.config import PdfActConfig from parxy_core.tracing.utils import trace_with_output +# Mapping from PdfAct categories to WAI-ARIA document structure roles +# See docs/explanation/document-roles.md for role definitions +PDFACT_TO_ROLE: dict[str, str] = { + 'abstract': 'doc-abstract', + 'acknowledgments': 'doc-acknowledgments', + 'affiliation': 'generic', + 'appendix': 'doc-appendix', + 'authors': 'generic', + 'body': 'paragraph', + 'caption': 'caption', + 'categories': 'generic', + 'figure': 'figure', + 'formula': 'math', + 'general-terms': 'generic', + 'itemize-item': 'listitem', + 'keywords': 'generic', + 'other': 'generic', + 'reference': 'doc-biblioref', + 'table': 'table', + 'toc': 'doc-toc', + # Footer variants + 'footer': 'doc-pagefooter', + 'page-footer': 'doc-pagefooter', + 'page_footer': 'doc-pagefooter', + 'page-number': 'doc-pagefooter', + # Footnote variants + 'footnote': 'doc-footnote', + 'note': 'doc-footnote', + 'endnote': 'doc-endnotes', + 'annotation': 'doc-footnote', + 'footer-note': 'doc-footnote', + # Heading variants + 'heading': 'heading', + 'title': 'doc-title', + 'subtitle': 'doc-subtitle', + 'section': 'heading', + 'chapter': 'doc-chapter', + # Header variants + 'page-header': 'doc-pageheader', + 'page_header': 'doc-pageheader', + 'page-heading': 'doc-pageheader', + 'header': 'doc-pageheader', +} + class PdfActDriver(Driver): """PdfAct service driver. @@ -287,6 +331,7 @@ def _convert_text_block( data = text_block.get('paragraph') text = data.get('text', '') category = data.get('role') if 'role' in data else None + role = PDFACT_TO_ROLE.get(category, 'generic') if category else 'generic' positions = data.get('positions', []) # Convert font and color @@ -308,6 +353,7 @@ def _convert_text_block( bbox = _convert_bbox([pos for pos in positions if pos['page'] == page]) return TextBlock( type='text', + role=role, bbox=bbox, style=style, page=page, diff --git a/src/parxy_core/models/models.py b/src/parxy_core/models/models.py index ad3f86c..07bfab3 100644 --- a/src/parxy_core/models/models.py +++ b/src/parxy_core/models/models.py @@ -60,17 +60,16 @@ def isEmpty(self) -> bool: class Block(BaseModel, ABC): type: str + role: Optional[str] = 'generic' + """Document Structure role recognized for this block""" bbox: Optional[BoundingBox] = None page: Optional[int] = None source_data: Optional[dict[str, Any]] = None + category: Optional[str] = None + """Category attributed to this block by the parser""" -class TextBlock(BaseModel): - type: str - bbox: Optional[BoundingBox] = None - page: Optional[int] = None - source_data: Optional[dict[str, Any]] = None - category: Optional[str] = None +class TextBlock(Block): style: Optional[Style] = None level: Optional[int] = None lines: Optional[List[Line]] = None