⚡️ Speed up function `clean_extra_whitespace_with_index_run` by 68% #4166

misrasaurabh1 · 2026-01-06T18:42:24Z

📄 68% (0.68x) speedup for `clean_extra_whitespace_with_index_run` in `unstructured/cleaners/core.py`

⏱️ Runtime : 3.74 milliseconds → 2.22 milliseconds (best of 19 runs)

📝 Explanation and details

The optimized code achieves a 68% speedup through two key changes that eliminate expensive operations in the main loop:

What Changed

Character replacement optimization: Replaced re.sub(r"[\xa0\n]", " ", text) with text.translate() using a translation table. This avoids regex compilation and pattern matching for simple character substitutions.
Main loop optimization: Eliminated two re.match() calls per iteration by:
- Pre-computing character comparisons (c_orig = text_chars[original_index])
- Using set membership (c_orig in ws_chars) instead of regex matching
- Direct character comparison (c_clean == ' ') instead of regex

Why It's Faster

Looking at the line profiler data, the original code spent 15.4% of total time (10.8% + 4.6%) on regex matching inside the loop:

bool(re.match("[\xa0\n]", text[original_index])) - 7.12ms (10.8%)
bool(re.match(" ", cleaned_text[cleaned_index])) - 3.02ms (4.6%)

The optimized version replaces these with:

Set membership check: c_orig in ws_chars - 1.07ms (1.4%)
Direct comparison: c_clean == ' ' (included in same line)

Result: Regex overhead is eliminated, saving ~9ms per 142 invocations in the benchmark.

Performance Profile

The annotated tests show the optimization excels when:

Large inputs with whitespace: test_large_leading_and_trailing_whitespace shows 291% speedup (203μs → 52.1μs)
Many consecutive whitespace characters: test_large_mixed_whitespace_everywhere shows 297% speedup (189μs → 47.8μs)
Mixed whitespace types (spaces, newlines, nbsp): test_edge_all_whitespace_between_words shows 47.9% speedup

Small inputs with minimal whitespace see minor regressions (~5-17% slower) due to setup overhead, but these are negligible in absolute terms (< 2μs difference).

Impact on Production Workloads

The function is called in _process_pdfminer_pages() during PDF text extraction, processing every text snippet on every page. Given that PDFs often contain:

Multiple spaces/tabs between words
Newlines from paragraph breaks
Non-breaking spaces from formatting

This optimization will provide substantial cumulative benefits when processing large documents with hundreds of pages, as the per-snippet savings compound across the entire document.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 45 Passed
⏪ Replay Tests	✅ 16 Passed
🔎 Concolic Coverage Tests	✅ 1 Passed
📊 Tests Coverage	100.0%

🌀 Click to see Generated Regression Tests

from __future__ import annotations

# imports
from unstructured.cleaners.core import clean_extra_whitespace_with_index_run

# unit tests

# --- BASIC TEST CASES ---


def test_basic_single_spaces():
    # No extra whitespace, should remain unchanged
    text = "Hello world"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 8.95μs -> 9.71μs (7.88% slower)


def test_basic_multiple_spaces():
    # Multiple spaces between words should be reduced to one
    text = "Hello     world"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 11.0μs -> 10.00μs (10.0% faster)


def test_basic_newlines_and_nbsp():
    # Newlines and non-breaking spaces replaced with single space
    text = "Hello\n\xa0world"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 12.8μs -> 10.2μs (25.2% faster)


def test_basic_leading_and_trailing_spaces():
    # Leading and trailing spaces should be stripped
    text = "   Hello world   "
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 10.4μs -> 9.88μs (5.62% faster)


def test_basic_only_spaces():
    # Only spaces should return an empty string
    text = "     "
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 6.10μs -> 6.45μs (5.43% slower)


def test_basic_only_newlines_and_nbsp():
    # Only newlines and non-breaking spaces should return empty string
    text = "\n\xa0\n\xa0"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 6.47μs -> 6.21μs (4.25% faster)


def test_basic_mixed_whitespace_between_words():
    # Mixed spaces, newlines, and nbsp between words
    text = "A\n\n\xa0   B"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 12.9μs -> 9.07μs (41.9% faster)


# --- EDGE TEST CASES ---


def test_edge_empty_string():
    # Empty string should return empty string and empty indices
    text = ""
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 5.53μs -> 5.62μs (1.73% slower)


def test_edge_all_whitespace():
    # String with only whitespace, newlines, and nbsp
    text = " \n\xa0  \n\xa0"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 6.91μs -> 7.15μs (3.40% slower)


def test_edge_one_character():
    # Single non-whitespace character
    text = "A"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 5.86μs -> 6.33μs (7.52% slower)


def test_edge_one_whitespace_character():
    # Single whitespace character
    text = " "
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 5.26μs -> 5.96μs (11.8% slower)


def test_edge_whitespace_between_every_char():
    # Whitespace between every character
    text = "H E L L O"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 7.13μs -> 8.59μs (17.0% slower)


def test_edge_multiple_types_of_whitespace():
    # Combination of spaces, newlines, and nbsp between words
    text = "A \n\xa0  B"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 12.3μs -> 8.56μs (44.1% faster)


def test_edge_trailing_newlines_and_nbsp():
    # Trailing newlines and nbsp should be stripped
    text = "Hello world\n\xa0"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 8.36μs -> 9.20μs (9.07% slower)


def test_edge_leading_newlines_and_nbsp():
    # Leading newlines and nbsp should be stripped
    text = "\n\xa0Hello world"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 11.3μs -> 9.86μs (14.6% faster)


def test_edge_alternating_whitespace():
    # Alternating whitespace and characters
    text = " H E L L O "
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 8.30μs -> 8.81μs (5.80% slower)


def test_edge_long_run_of_whitespace():
    # Long run of whitespace in the middle
    text = "Hello" + (" " * 50) + "world"
    cleaned, indices = clean_extra_whitespace_with_index_run(text)  # 27.5μs -> 13.4μs (106% faster)


# --- LARGE SCALE TEST CASES ---


def test_large_no_extra_whitespace():
    # Large string with no extra whitespace
    text = "A" * 1000
    cleaned, indices = clean_extra_whitespace_with_index_run(text)  # 106μs -> 93.6μs (13.3% faster)


def test_large_all_whitespace():
    # Large string of only whitespace
    text = " " * 1000
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 13.1μs -> 8.95μs (46.6% faster)


def test_large_alternating_char_and_whitespace():
    # Large string alternating between character and whitespace
    text = "".join(["A " for _ in range(500)])  # 500 'A ', total length 1000
    cleaned, indices = clean_extra_whitespace_with_index_run(text)  # 106μs -> 95.5μs (11.5% faster)


def test_large_multiple_whitespace_blocks():
    # Large string with random blocks of whitespace
    text = "A" + (" " * 10) + "B" + ("\n" * 10) + "C" + ("\xa0" * 10) + "D"
    cleaned, indices = clean_extra_whitespace_with_index_run(text)  # 28.6μs -> 12.9μs (122% faster)


def test_large_leading_and_trailing_whitespace():
    # Large leading and trailing whitespace
    text = (" " * 500) + "Hello world" + (" " * 500)
    cleaned, indices = clean_extra_whitespace_with_index_run(text)  # 203μs -> 52.1μs (291% faster)


def test_large_mixed_whitespace_everywhere():
    # Large text with mixed whitespace everywhere
    text = (" " * 100) + "A" + ("\n" * 100) + "B" + ("\xa0" * 100) + "C" + (" " * 100)
    cleaned, indices = clean_extra_whitespace_with_index_run(text)  # 189μs -> 47.8μs (297% faster)


# --- FUNCTIONALITY AND INTEGRITY TESTS ---


def test_mutation_detection_extra_space():
    # If function fails to remove extra spaces, test should fail
    text = "Test     case"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 9.65μs -> 8.87μs (8.84% faster)


def test_mutation_detection_strip():
    # If function fails to strip leading/trailing whitespace, test should fail
    text = "   Test case   "
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 9.64μs -> 8.97μs (7.41% faster)


def test_mutation_detection_newline_nbsp():
    # If function fails to replace newlines or nbsp, test should fail
    text = "Test\n\xa0case"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 11.7μs -> 9.45μs (23.5% faster)


def test_mutation_detection_index_integrity():
    # Changing the index logic should break this test
    text = "A     B"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 8.82μs -> 7.73μs (14.2% faster)


def test_mutation_detection_empty_output():
    # If function fails to return empty string for all whitespace, test should fail
    text = "   \n\xa0   "
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 7.79μs -> 8.53μs (8.65% slower)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from __future__ import annotations

# imports
from unstructured.cleaners.core import clean_extra_whitespace_with_index_run

# unit tests

# 1. Basic Test Cases


def test_basic_no_extra_whitespace():
    # Text with no extra whitespace should remain unchanged
    text = "Hello world!"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 10.3μs -> 10.9μs (5.46% slower)


def test_basic_multiple_spaces_between_words():
    # Multiple spaces between words should be reduced to one
    text = "Hello    world!"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 11.1μs -> 10.2μs (9.12% faster)


def test_basic_leading_and_trailing_spaces():
    # Leading and trailing spaces should be stripped
    text = "   Hello world!   "
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 10.5μs -> 9.89μs (6.26% faster)


def test_basic_newline_and_nonbreaking_space():
    # Newlines and non-breaking spaces should be converted to single spaces
    text = "Hello\nworld!\xa0Test"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 12.4μs -> 9.69μs (28.0% faster)


def test_basic_combined_whitespace_types():
    # Combination of spaces, newlines, and non-breaking spaces
    text = "A  \n\xa0  B"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 11.4μs -> 9.02μs (26.3% faster)


# 2. Edge Test Cases


def test_edge_empty_string():
    # Empty string should return empty string and empty indices
    text = ""
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 5.58μs -> 5.64μs (1.01% slower)


def test_edge_only_spaces():
    # String with only spaces should return empty string
    text = "     "
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 5.98μs -> 6.55μs (8.71% slower)


def test_edge_only_newlines_and_nbsp():
    # String with only newlines and non-breaking spaces
    text = "\n\xa0\n\xa0"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 6.54μs -> 6.06μs (7.91% faster)


def test_edge_single_character():
    # Single character should remain unchanged
    text = "A"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 6.01μs -> 6.45μs (6.78% slower)


def test_edge_all_whitespace_between_words():
    # All whitespace between words should be reduced to one space
    text = "A   \n\xa0   B"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 13.4μs -> 9.08μs (47.9% faster)


def test_edge_whitespace_at_various_positions():
    # Whitespace at start, middle, and end
    text = "   A  B   "
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 9.58μs -> 8.26μs (16.0% faster)


def test_edge_multiple_consecutive_whitespace_groups():
    # Several groups of consecutive whitespace
    text = "A  \n\n  B    C"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 12.9μs -> 9.42μs (37.3% faster)


# 3. Large Scale Test Cases


def test_large_long_string_with_regular_spacing():
    # Large string with regular words and single spaces
    text = "word " * 200
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text.strip()
    )  # 107μs -> 95.9μs (12.2% faster)


def test_large_long_string_with_extra_spaces():
    # Large string with extra spaces between words
    text = ("word    " * 200).strip()
    cleaned, indices = clean_extra_whitespace_with_index_run(text)  # 402μs -> 180μs (123% faster)


def test_large_mixed_whitespace():
    # Large string with mixed whitespace types
    words = ["word"] * 500
    text = " \n\xa0 ".join(words)
    cleaned, indices = clean_extra_whitespace_with_index_run(text)  # 1.37ms -> 598μs (129% faster)


def test_large_leading_and_trailing_whitespace():
    # Large string with leading and trailing whitespace
    text = " " * 100 + "word " * 800 + " " * 100
    cleaned, indices = clean_extra_whitespace_with_index_run(text)  # 468μs -> 374μs (25.1% faster)


def test_large_string_all_whitespace():
    # Large string of only whitespace
    text = " " * 999
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 13.8μs -> 8.85μs (55.9% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from unstructured.cleaners.core import clean_extra_whitespace_with_index_run


def test_clean_extra_whitespace_with_index_run():
    clean_extra_whitespace_with_index_run("\n\x00")

⏪ Click to see Replay Tests

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`test_benchmark1_py__replay_test_0.py::test_unstructured_cleaners_core_clean_extra_whitespace_with_index_run`	376μs	347μs	8.63%✅

🔎 Click to see Concolic Coverage Tests

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`codeflash_concolic_3yq4ufg_/tmp5dfyu5tu/test_concolic_coverage.py::test_clean_extra_whitespace_with_index_run`	27.1μs	17.7μs	52.7%✅

To edit these changes git checkout codeflash/optimize-clean_extra_whitespace_with_index_run-mji60td0 and push.

The optimized code achieves a **68% speedup** through two key changes that eliminate expensive operations in the main loop: ## What Changed 1. **Character replacement optimization**: Replaced `re.sub(r"[\xa0\n]", " ", text)` with `text.translate()` using a translation table. This avoids regex compilation and pattern matching for simple character substitutions. 2. **Main loop optimization**: Eliminated two `re.match()` calls per iteration by: - Pre-computing character comparisons (`c_orig = text_chars[original_index]`) - Using set membership (`c_orig in ws_chars`) instead of regex matching - Direct character comparison (`c_clean == ' '`) instead of regex ## Why It's Faster Looking at the line profiler data, the original code spent **15.4% of total time** (10.8% + 4.6%) on regex matching inside the loop: - `bool(re.match("[\xa0\n]", text[original_index]))` - 7.12ms (10.8%) - `bool(re.match(" ", cleaned_text[cleaned_index]))` - 3.02ms (4.6%) The optimized version replaces these with: - Set membership check: `c_orig in ws_chars` - 1.07ms (1.4%) - Direct comparison: `c_clean == ' '` (included in same line) **Result**: Regex overhead is eliminated, saving ~9ms per 142 invocations in the benchmark. ## Performance Profile The annotated tests show the optimization excels when: - **Large inputs with whitespace**: `test_large_leading_and_trailing_whitespace` shows 291% speedup (203μs → 52.1μs) - **Many consecutive whitespace characters**: `test_large_mixed_whitespace_everywhere` shows 297% speedup (189μs → 47.8μs) - **Mixed whitespace types** (spaces, newlines, nbsp): `test_edge_all_whitespace_between_words` shows 47.9% speedup Small inputs with minimal whitespace see minor regressions (~5-17% slower) due to setup overhead, but these are negligible in absolute terms (< 2μs difference). ## Impact on Production Workloads The function is called in `_process_pdfminer_pages()` during PDF text extraction, processing **every text snippet on every page**. Given that PDFs often contain: - Multiple spaces/tabs between words - Newlines from paragraph breaks - Non-breaking spaces from formatting This optimization will provide substantial cumulative benefits when processing large documents with hundreds of pages, as the per-snippet savings compound across the entire document.

…th_index_run-mji60td0

qued · 2026-01-07T20:15:28Z

@claude review this please.

claude · 2026-01-07T20:15:47Z

Claude encountered an error —— View job

Failed with exit code 128

I'll analyze this and get back to you.

…th_index_run-mji60td0

codeflash-ai bot and others added 11 commits December 23, 2025 05:49

clean logic

b17abe1

Merge branch 'main' into codeflash/optimize-clean_extra_whitespace_wi…

b724dd0

…th_index_run-mji60td0

changelog and version

66f0449

changelog and version

2519504

changelog fix

58a4d57

undo changelog edit

b56f1e3

undo newline

6dbb249

correct number of newlines

046d7b1

Merge branch 'main' into codeflash/optimize-clean_extra_whitespace_wi…

64860ce

…th_index_run-mji60td0

version sync

9985bec

Merge branch 'main' into codeflash/optimize-clean_extra_whitespace_wi…

020f515

…th_index_run-mji60td0

qued approved these changes Jan 7, 2026

View reviewed changes

qued merged commit 03be9c0 into Unstructured-IO:main Jan 8, 2026
39 of 40 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `clean_extra_whitespace_with_index_run` by 68% #4166

⚡️ Speed up function `clean_extra_whitespace_with_index_run` by 68% #4166

misrasaurabh1 commented Jan 6, 2026

Uh oh!

qued commented Jan 7, 2026

Uh oh!

claude bot commented Jan 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

⚡️ Speed up function clean_extra_whitespace_with_index_run by 68% #4166

⚡️ Speed up function clean_extra_whitespace_with_index_run by 68% #4166

Conversation

misrasaurabh1 commented Jan 6, 2026

📄 68% (0.68x) speedup for clean_extra_whitespace_with_index_run in unstructured/cleaners/core.py

📝 Explanation and details

What Changed

Why It's Faster

Performance Profile

Impact on Production Workloads

Uh oh!

qued commented Jan 7, 2026

Uh oh!

claude bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

⚡️ Speed up function `clean_extra_whitespace_with_index_run` by 68% #4166

⚡️ Speed up function `clean_extra_whitespace_with_index_run` by 68% #4166

📄 68% (0.68x) speedup for `clean_extra_whitespace_with_index_run` in `unstructured/cleaners/core.py`

claude bot commented Jan 7, 2026 •

edited

Loading