fix: preserve row boundaries in Table text representation#4279
Open
AlonNaor22 wants to merge 1 commit intoUnstructured-IO:mainfrom
Open
fix: preserve row boundaries in Table text representation#4279AlonNaor22 wants to merge 1 commit intoUnstructured-IO:mainfrom
AlonNaor22 wants to merge 1 commit intoUnstructured-IO:mainfrom
Conversation
HtmlTable.text was joining all cell text with spaces, destroying row boundaries. This was a regression from v0.16.1 (commit c85f29e) when soupparser_fromstring().text_content() was replaced with itertext()-based flattening. Now cells within a row are space-separated and rows are newline-separated, restoring the ability to reconstruct row structure from str(Table). Closes Unstructured-IO#4235 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
partition_xlsx(and CSV, TSV, PPTX)Tableelements lost row boundaries instr(Table)after v0.16.1HtmlTable.textused" ".join(self._table.itertext())which flattened all cell text across all rows into a single space-separated string, destroying row structureHtmlTable.textnow joins cells within a row with spaces and joins rows with newlines, restoring the ability to split table text by\nto get logical rowssoupparser_fromstring().text_content()(which preserved newlines) with the flatteningitertext()approachiter_cell_texts()Files changed
unstructured/common/html_table.py—HtmlTable.textproperty: iterate rows viaiter_rows()and join with\ninstead of flattening all text with spacestest_unstructured/common/test_html_table.py— updated expected text to use newline-separated rowstest_unstructured/partition/test_constants.py— updatedEXPECTED_TEXT*constants to reflect newline-separated row formattest_unstructured/partition/test_xlsx.py— updated direct.text ==assertions for subtable andfind_subtable=Falseteststest_unstructured/partition/test_csv.py— removed unnecessaryclean_extra_whitespacewrappers and unused importtest_unstructured/partition/test_tsv.py— updated header test assertionTest plan
test_partition_csv_with_encodingunrelated to this change)ruff check+ruff format)