Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML backend fails to properly extract content from Bootstrap accordion components and incorrectly includes hidden elements in the extracted text #1112

Open
ulan-yisaev opened this issue Mar 4, 2025 · 2 comments · May be fixed by #1115
Labels
bug Something isn't working

Comments

@ulan-yisaev
Copy link

Bug

The HTML backend fails to properly extract content from Bootstrap accordion components and incorrectly includes hidden elements in the extracted text. Specifically:

  1. Questions in accordion panels (inside <div class="panel-title"><a>...</a></div>) are skipped during extraction, resulting in only answers being included in the output.

  2. Content marked as hidden (with classes like "hidden", "d-none", or attributes like "hidden", or styles like "display:none") is incorrectly included in the extracted text, polluting the output with metadata and invisible elements.

Steps to reproduce

For accordion extraction issue:

  1. Create an HTML file with Bootstrap accordion components (example below (f.a.q. page))
  2. Process it with Docling
  3. Observe that questions in panel titles are missing from the output
<div class="panel panel-default">
  <div class="panel-heading">
    <div class="panel-title">
      <a>Question text here?</a>
    </div>
  </div>
  <div class="panel-body">
    <p>Answer text here.</p>
  </div>
</div>

For hidden content issue:

  1. Create an HTML file with hidden elements (example below)
  2. Process it with Docling
  3. Observe that hidden content is included in the output
<div class="container">
  <p>Visible content</p>
  <div class="hidden">Hidden metadata that should be skipped</div>
</div>

Docling version

docling 2.25.1

Python version

Python 3.12

I've implemented a fix for both issues that:

  1. Adds specialized handlers for accordion components
  2. Implements proper detection and filtering of hidden elements
  3. Includes comprehensive tests

I'd be happy to submit a pull request with these changes if this bug report is confirmed.

@ulan-yisaev ulan-yisaev added the bug Something isn't working label Mar 4, 2025
@PeterStaar-IBM
Copy link
Contributor

@ulan-yisaev I agree, we need to expand the HTML parser. Happy to get your PR!

@ulan-yisaev
Copy link
Author

@PeterStaar-IBM Thank you for your quick response! I've submitted a PR: #1115

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants