Skip to content

fix: preserve row boundaries in Table text representation#4279

Open
AlonNaor22 wants to merge 1 commit intoUnstructured-IO:mainfrom
AlonNaor22:fix/html-table-text-row-boundaries
Open

fix: preserve row boundaries in Table text representation#4279
AlonNaor22 wants to merge 1 commit intoUnstructured-IO:mainfrom
AlonNaor22:fix/html-table-text-row-boundaries

Conversation

@AlonNaor22
Copy link

Summary

  • Fixes bug/XLSX Table string representation loses row boundaries after 0.16.0 causing flattened output from partition_xlsx #4235partition_xlsx (and CSV, TSV, PPTX) Table elements lost row boundaries in str(Table) after v0.16.1
  • Root cause: HtmlTable.text used " ".join(self._table.itertext()) which flattened all cell text across all rows into a single space-separated string, destroying row structure
  • Fix: HtmlTable.text now joins cells within a row with spaces and joins rows with newlines, restoring the ability to split table text by \n to get logical rows
  • This was a regression from commit c85f29e which replaced soupparser_fromstring().text_content() (which preserved newlines) with the flattening itertext() approach
  • Chunking code is unaffected — it already does its own text extraction independently via iter_cell_texts()

Files changed

  • unstructured/common/html_table.pyHtmlTable.text property: iterate rows via iter_rows() and join with \n instead of flattening all text with spaces
  • test_unstructured/common/test_html_table.py — updated expected text to use newline-separated rows
  • test_unstructured/partition/test_constants.py — updated EXPECTED_TEXT* constants to reflect newline-separated row format
  • test_unstructured/partition/test_xlsx.py — updated direct .text == assertions for subtable and find_subtable=False tests
  • test_unstructured/partition/test_csv.py — removed unnecessary clean_extra_whitespace wrappers and unused import
  • test_unstructured/partition/test_tsv.py — updated header test assertion

Test plan

  • All 74 XLSX tests pass
  • All 24 html_table unit tests pass
  • All CSV tests pass (1 pre-existing failure in test_partition_csv_with_encoding unrelated to this change)
  • All 18 TSV tests pass
  • Linter and formatter clean (ruff check + ruff format)

HtmlTable.text was joining all cell text with spaces, destroying row
boundaries. This was a regression from v0.16.1 (commit c85f29e) when
soupparser_fromstring().text_content() was replaced with itertext()-based
flattening. Now cells within a row are space-separated and rows are
newline-separated, restoring the ability to reconstruct row structure
from str(Table).

Closes Unstructured-IO#4235

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug/XLSX Table string representation loses row boundaries after 0.16.0 causing flattened output from partition_xlsx

1 participant