You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
HTML backend fails to properly extract content from Bootstrap accordion components and incorrectly includes hidden elements in the extracted text
#1112
Open
ulan-yisaev opened this issue
Mar 4, 2025
· 2 comments
· May be fixed by #1115
The HTML backend fails to properly extract content from Bootstrap accordion components and incorrectly includes hidden elements in the extracted text. Specifically:
Questions in accordion panels (inside <div class="panel-title"><a>...</a></div>) are skipped during extraction, resulting in only answers being included in the output.
Content marked as hidden (with classes like "hidden", "d-none", or attributes like "hidden", or styles like "display:none") is incorrectly included in the extracted text, polluting the output with metadata and invisible elements.
Steps to reproduce
For accordion extraction issue:
Create an HTML file with Bootstrap accordion components (example below (f.a.q. page))
Process it with Docling
Observe that questions in panel titles are missing from the output
<divclass="panel panel-default"><divclass="panel-heading"><divclass="panel-title"><a>Question text here?</a></div></div><divclass="panel-body"><p>Answer text here.</p></div></div>
For hidden content issue:
Create an HTML file with hidden elements (example below)
Process it with Docling
Observe that hidden content is included in the output
<divclass="container"><p>Visible content</p><divclass="hidden">Hidden metadata that should be skipped</div></div>
Docling version
docling 2.25.1
Python version
Python 3.12
I've implemented a fix for both issues that:
Adds specialized handlers for accordion components
Implements proper detection and filtering of hidden elements
Includes comprehensive tests
I'd be happy to submit a pull request with these changes if this bug report is confirmed.
The text was updated successfully, but these errors were encountered:
Bug
The HTML backend fails to properly extract content from Bootstrap accordion components and incorrectly includes hidden elements in the extracted text. Specifically:
Questions in accordion panels (inside
<div class="panel-title"><a>...</a></div>
) are skipped during extraction, resulting in only answers being included in the output.Content marked as hidden (with classes like "hidden", "d-none", or attributes like "hidden", or styles like "display:none") is incorrectly included in the extracted text, polluting the output with metadata and invisible elements.
Steps to reproduce
For accordion extraction issue:
For hidden content issue:
Docling version
docling 2.25.1
Python version
Python 3.12
I've implemented a fix for both issues that:
I'd be happy to submit a pull request with these changes if this bug report is confirmed.
The text was updated successfully, but these errors were encountered: