-
Notifications
You must be signed in to change notification settings - Fork 1.1k
⚡️ Speed up function clean_extra_whitespace_with_index_run by 68%
#4166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
qued
merged 12 commits into
Unstructured-IO:main
from
misrasaurabh1:codeflash/optimize-clean_extra_whitespace_with_index_run-mji60td0
Jan 8, 2026
Merged
⚡️ Speed up function clean_extra_whitespace_with_index_run by 68%
#4166
qued
merged 12 commits into
Unstructured-IO:main
from
misrasaurabh1:codeflash/optimize-clean_extra_whitespace_with_index_run-mji60td0
Jan 8, 2026
+23
−8
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The optimized code achieves a **68% speedup** through two key changes that eliminate expensive operations in the main loop:
## What Changed
1. **Character replacement optimization**: Replaced `re.sub(r"[\xa0\n]", " ", text)` with `text.translate()` using a translation table. This avoids regex compilation and pattern matching for simple character substitutions.
2. **Main loop optimization**: Eliminated two `re.match()` calls per iteration by:
- Pre-computing character comparisons (`c_orig = text_chars[original_index]`)
- Using set membership (`c_orig in ws_chars`) instead of regex matching
- Direct character comparison (`c_clean == ' '`) instead of regex
## Why It's Faster
Looking at the line profiler data, the original code spent **15.4% of total time** (10.8% + 4.6%) on regex matching inside the loop:
- `bool(re.match("[\xa0\n]", text[original_index]))` - 7.12ms (10.8%)
- `bool(re.match(" ", cleaned_text[cleaned_index]))` - 3.02ms (4.6%)
The optimized version replaces these with:
- Set membership check: `c_orig in ws_chars` - 1.07ms (1.4%)
- Direct comparison: `c_clean == ' '` (included in same line)
**Result**: Regex overhead is eliminated, saving ~9ms per 142 invocations in the benchmark.
## Performance Profile
The annotated tests show the optimization excels when:
- **Large inputs with whitespace**: `test_large_leading_and_trailing_whitespace` shows 291% speedup (203μs → 52.1μs)
- **Many consecutive whitespace characters**: `test_large_mixed_whitespace_everywhere` shows 297% speedup (189μs → 47.8μs)
- **Mixed whitespace types** (spaces, newlines, nbsp): `test_edge_all_whitespace_between_words` shows 47.9% speedup
Small inputs with minimal whitespace see minor regressions (~5-17% slower) due to setup overhead, but these are negligible in absolute terms (< 2μs difference).
## Impact on Production Workloads
The function is called in `_process_pdfminer_pages()` during PDF text extraction, processing **every text snippet on every page**. Given that PDFs often contain:
- Multiple spaces/tabs between words
- Newlines from paragraph breaks
- Non-breaking spaces from formatting
This optimization will provide substantial cumulative benefits when processing large documents with hundreds of pages, as the per-snippet savings compound across the entire document.
…th_index_run-mji60td0
…th_index_run-mji60td0
Contributor
|
@claude review this please. |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
…th_index_run-mji60td0
qued
approved these changes
Jan 7, 2026
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 68% (0.68x) speedup for
clean_extra_whitespace_with_index_runinunstructured/cleaners/core.py⏱️ Runtime :
3.74 milliseconds→2.22 milliseconds(best of19runs)📝 Explanation and details
The optimized code achieves a 68% speedup through two key changes that eliminate expensive operations in the main loop:
What Changed
Character replacement optimization: Replaced
re.sub(r"[\xa0\n]", " ", text)withtext.translate()using a translation table. This avoids regex compilation and pattern matching for simple character substitutions.Main loop optimization: Eliminated two
re.match()calls per iteration by:c_orig = text_chars[original_index])c_orig in ws_chars) instead of regex matchingc_clean == ' ') instead of regexWhy It's Faster
Looking at the line profiler data, the original code spent 15.4% of total time (10.8% + 4.6%) on regex matching inside the loop:
bool(re.match("[\xa0\n]", text[original_index]))- 7.12ms (10.8%)bool(re.match(" ", cleaned_text[cleaned_index]))- 3.02ms (4.6%)The optimized version replaces these with:
c_orig in ws_chars- 1.07ms (1.4%)c_clean == ' '(included in same line)Result: Regex overhead is eliminated, saving ~9ms per 142 invocations in the benchmark.
Performance Profile
The annotated tests show the optimization excels when:
test_large_leading_and_trailing_whitespaceshows 291% speedup (203μs → 52.1μs)test_large_mixed_whitespace_everywhereshows 297% speedup (189μs → 47.8μs)test_edge_all_whitespace_between_wordsshows 47.9% speedupSmall inputs with minimal whitespace see minor regressions (~5-17% slower) due to setup overhead, but these are negligible in absolute terms (< 2μs difference).
Impact on Production Workloads
The function is called in
_process_pdfminer_pages()during PDF text extraction, processing every text snippet on every page. Given that PDFs often contain:This optimization will provide substantial cumulative benefits when processing large documents with hundreds of pages, as the per-snippet savings compound across the entire document.
✅ Correctness verification report:
🌀 Click to see Generated Regression Tests
⏪ Click to see Replay Tests
test_benchmark1_py__replay_test_0.py::test_unstructured_cleaners_core_clean_extra_whitespace_with_index_run🔎 Click to see Concolic Coverage Tests
codeflash_concolic_3yq4ufg_/tmp5dfyu5tu/test_concolic_coverage.py::test_clean_extra_whitespace_with_index_runTo edit these changes
git checkout codeflash/optimize-clean_extra_whitespace_with_index_run-mji60td0and push.