Skip to content

fix: skip complexity check when strategy is explicitly fast#4280

Open
AlonNaor22 wants to merge 1 commit intoUnstructured-IO:mainfrom
AlonNaor22:fix/pdf-fast-strategy-empty-result
Open

fix: skip complexity check when strategy is explicitly fast#4280
AlonNaor22 wants to merge 1 commit intoUnstructured-IO:mainfrom
AlonNaor22:fix/pdf-fast-strategy-empty-result

Conversation

@AlonNaor22
Copy link

Summary

  • Fixes bug/partition_pdf() with stategy = "fast" is not extracting any elements #4260partition_pdf(strategy="fast") returns an empty list for PDFs flagged as "too complex"
  • Root cause: the is_pdf_too_complex() check added in Add a check for complex pdfs #4268 skips text extraction when a PDF is complex, but when strategy="fast" is explicitly set, the strategy isn't changed — so _partition_pdf_with_pdfparser() receives empty extracted_elements and returns []
  • Fix: bypass the is_pdf_too_complex() check entirely when strategy="fast", since the user explicitly requested text-based extraction. The complexity check is only relevant for strategy="auto" where it decides between fast and hi_res.

Files changed

  • unstructured/partition/pdf.py — added strategy != PartitionStrategy.FAST guard before calling is_pdf_too_complex()
  • test_unstructured/partition/pdf_image/test_pdf.py — added test verifying fast strategy bypasses complexity check and returns non-empty results

Test plan

  • New test: test_partition_pdf_fast_strategy_bypasses_complexity_check — mocks is_pdf_too_complex to return True, verifies it's never called when strategy is fast, verifies elements are extracted
  • Existing fast strategy tests pass (test_partition_pdf_with_fast_strategy, test_partition_pdf_with_fast_strategy_from_file, test_partition_pdf_with_fast_groups_text)
  • Existing complexity check tests pass
  • Linter and formatter clean

The is_pdf_too_complex() check introduced in Unstructured-IO#4268 skips text extraction
for complex PDFs. But when strategy="fast" is explicitly passed, this
leaves extracted_elements empty, causing _partition_pdf_with_pdfparser()
to return an empty list. Now the complexity check is bypassed when the
user explicitly requests the fast strategy.

Closes Unstructured-IO#4260

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug/partition_pdf() with stategy = "fast" is not extracting any elements

1 participant