Skip to content

Conversation

@misrasaurabh1
Copy link
Contributor

📄 68% (0.68x) speedup for clean_extra_whitespace_with_index_run in unstructured/cleaners/core.py

⏱️ Runtime : 3.74 milliseconds 2.22 milliseconds (best of 19 runs)

📝 Explanation and details

The optimized code achieves a 68% speedup through two key changes that eliminate expensive operations in the main loop:

What Changed

  1. Character replacement optimization: Replaced re.sub(r"[\xa0\n]", " ", text) with text.translate() using a translation table. This avoids regex compilation and pattern matching for simple character substitutions.

  2. Main loop optimization: Eliminated two re.match() calls per iteration by:

    • Pre-computing character comparisons (c_orig = text_chars[original_index])
    • Using set membership (c_orig in ws_chars) instead of regex matching
    • Direct character comparison (c_clean == ' ') instead of regex

Why It's Faster

Looking at the line profiler data, the original code spent 15.4% of total time (10.8% + 4.6%) on regex matching inside the loop:

  • bool(re.match("[\xa0\n]", text[original_index])) - 7.12ms (10.8%)
  • bool(re.match(" ", cleaned_text[cleaned_index])) - 3.02ms (4.6%)

The optimized version replaces these with:

  • Set membership check: c_orig in ws_chars - 1.07ms (1.4%)
  • Direct comparison: c_clean == ' ' (included in same line)

Result: Regex overhead is eliminated, saving ~9ms per 142 invocations in the benchmark.

Performance Profile

The annotated tests show the optimization excels when:

  • Large inputs with whitespace: test_large_leading_and_trailing_whitespace shows 291% speedup (203μs → 52.1μs)
  • Many consecutive whitespace characters: test_large_mixed_whitespace_everywhere shows 297% speedup (189μs → 47.8μs)
  • Mixed whitespace types (spaces, newlines, nbsp): test_edge_all_whitespace_between_words shows 47.9% speedup

Small inputs with minimal whitespace see minor regressions (~5-17% slower) due to setup overhead, but these are negligible in absolute terms (< 2μs difference).

Impact on Production Workloads

The function is called in _process_pdfminer_pages() during PDF text extraction, processing every text snippet on every page. Given that PDFs often contain:

  • Multiple spaces/tabs between words
  • Newlines from paragraph breaks
  • Non-breaking spaces from formatting

This optimization will provide substantial cumulative benefits when processing large documents with hundreds of pages, as the per-snippet savings compound across the entire document.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 45 Passed
⏪ Replay Tests 16 Passed
🔎 Concolic Coverage Tests 1 Passed
📊 Tests Coverage 100.0%
🌀 Click to see Generated Regression Tests
from __future__ import annotations

# imports
from unstructured.cleaners.core import clean_extra_whitespace_with_index_run

# unit tests

# --- BASIC TEST CASES ---


def test_basic_single_spaces():
    # No extra whitespace, should remain unchanged
    text = "Hello world"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 8.95μs -> 9.71μs (7.88% slower)


def test_basic_multiple_spaces():
    # Multiple spaces between words should be reduced to one
    text = "Hello     world"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 11.0μs -> 10.00μs (10.0% faster)


def test_basic_newlines_and_nbsp():
    # Newlines and non-breaking spaces replaced with single space
    text = "Hello\n\xa0world"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 12.8μs -> 10.2μs (25.2% faster)


def test_basic_leading_and_trailing_spaces():
    # Leading and trailing spaces should be stripped
    text = "   Hello world   "
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 10.4μs -> 9.88μs (5.62% faster)


def test_basic_only_spaces():
    # Only spaces should return an empty string
    text = "     "
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 6.10μs -> 6.45μs (5.43% slower)


def test_basic_only_newlines_and_nbsp():
    # Only newlines and non-breaking spaces should return empty string
    text = "\n\xa0\n\xa0"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 6.47μs -> 6.21μs (4.25% faster)


def test_basic_mixed_whitespace_between_words():
    # Mixed spaces, newlines, and nbsp between words
    text = "A\n\n\xa0   B"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 12.9μs -> 9.07μs (41.9% faster)


# --- EDGE TEST CASES ---


def test_edge_empty_string():
    # Empty string should return empty string and empty indices
    text = ""
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 5.53μs -> 5.62μs (1.73% slower)


def test_edge_all_whitespace():
    # String with only whitespace, newlines, and nbsp
    text = " \n\xa0  \n\xa0"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 6.91μs -> 7.15μs (3.40% slower)


def test_edge_one_character():
    # Single non-whitespace character
    text = "A"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 5.86μs -> 6.33μs (7.52% slower)


def test_edge_one_whitespace_character():
    # Single whitespace character
    text = " "
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 5.26μs -> 5.96μs (11.8% slower)


def test_edge_whitespace_between_every_char():
    # Whitespace between every character
    text = "H E L L O"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 7.13μs -> 8.59μs (17.0% slower)


def test_edge_multiple_types_of_whitespace():
    # Combination of spaces, newlines, and nbsp between words
    text = "A \n\xa0  B"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 12.3μs -> 8.56μs (44.1% faster)


def test_edge_trailing_newlines_and_nbsp():
    # Trailing newlines and nbsp should be stripped
    text = "Hello world\n\xa0"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 8.36μs -> 9.20μs (9.07% slower)


def test_edge_leading_newlines_and_nbsp():
    # Leading newlines and nbsp should be stripped
    text = "\n\xa0Hello world"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 11.3μs -> 9.86μs (14.6% faster)


def test_edge_alternating_whitespace():
    # Alternating whitespace and characters
    text = " H E L L O "
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 8.30μs -> 8.81μs (5.80% slower)


def test_edge_long_run_of_whitespace():
    # Long run of whitespace in the middle
    text = "Hello" + (" " * 50) + "world"
    cleaned, indices = clean_extra_whitespace_with_index_run(text)  # 27.5μs -> 13.4μs (106% faster)


# --- LARGE SCALE TEST CASES ---


def test_large_no_extra_whitespace():
    # Large string with no extra whitespace
    text = "A" * 1000
    cleaned, indices = clean_extra_whitespace_with_index_run(text)  # 106μs -> 93.6μs (13.3% faster)


def test_large_all_whitespace():
    # Large string of only whitespace
    text = " " * 1000
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 13.1μs -> 8.95μs (46.6% faster)


def test_large_alternating_char_and_whitespace():
    # Large string alternating between character and whitespace
    text = "".join(["A " for _ in range(500)])  # 500 'A ', total length 1000
    cleaned, indices = clean_extra_whitespace_with_index_run(text)  # 106μs -> 95.5μs (11.5% faster)


def test_large_multiple_whitespace_blocks():
    # Large string with random blocks of whitespace
    text = "A" + (" " * 10) + "B" + ("\n" * 10) + "C" + ("\xa0" * 10) + "D"
    cleaned, indices = clean_extra_whitespace_with_index_run(text)  # 28.6μs -> 12.9μs (122% faster)


def test_large_leading_and_trailing_whitespace():
    # Large leading and trailing whitespace
    text = (" " * 500) + "Hello world" + (" " * 500)
    cleaned, indices = clean_extra_whitespace_with_index_run(text)  # 203μs -> 52.1μs (291% faster)


def test_large_mixed_whitespace_everywhere():
    # Large text with mixed whitespace everywhere
    text = (" " * 100) + "A" + ("\n" * 100) + "B" + ("\xa0" * 100) + "C" + (" " * 100)
    cleaned, indices = clean_extra_whitespace_with_index_run(text)  # 189μs -> 47.8μs (297% faster)


# --- FUNCTIONALITY AND INTEGRITY TESTS ---


def test_mutation_detection_extra_space():
    # If function fails to remove extra spaces, test should fail
    text = "Test     case"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 9.65μs -> 8.87μs (8.84% faster)


def test_mutation_detection_strip():
    # If function fails to strip leading/trailing whitespace, test should fail
    text = "   Test case   "
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 9.64μs -> 8.97μs (7.41% faster)


def test_mutation_detection_newline_nbsp():
    # If function fails to replace newlines or nbsp, test should fail
    text = "Test\n\xa0case"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 11.7μs -> 9.45μs (23.5% faster)


def test_mutation_detection_index_integrity():
    # Changing the index logic should break this test
    text = "A     B"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 8.82μs -> 7.73μs (14.2% faster)


def test_mutation_detection_empty_output():
    # If function fails to return empty string for all whitespace, test should fail
    text = "   \n\xa0   "
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 7.79μs -> 8.53μs (8.65% slower)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from __future__ import annotations

# imports
from unstructured.cleaners.core import clean_extra_whitespace_with_index_run

# unit tests

# 1. Basic Test Cases


def test_basic_no_extra_whitespace():
    # Text with no extra whitespace should remain unchanged
    text = "Hello world!"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 10.3μs -> 10.9μs (5.46% slower)


def test_basic_multiple_spaces_between_words():
    # Multiple spaces between words should be reduced to one
    text = "Hello    world!"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 11.1μs -> 10.2μs (9.12% faster)


def test_basic_leading_and_trailing_spaces():
    # Leading and trailing spaces should be stripped
    text = "   Hello world!   "
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 10.5μs -> 9.89μs (6.26% faster)


def test_basic_newline_and_nonbreaking_space():
    # Newlines and non-breaking spaces should be converted to single spaces
    text = "Hello\nworld!\xa0Test"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 12.4μs -> 9.69μs (28.0% faster)


def test_basic_combined_whitespace_types():
    # Combination of spaces, newlines, and non-breaking spaces
    text = "A  \n\xa0  B"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 11.4μs -> 9.02μs (26.3% faster)


# 2. Edge Test Cases


def test_edge_empty_string():
    # Empty string should return empty string and empty indices
    text = ""
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 5.58μs -> 5.64μs (1.01% slower)


def test_edge_only_spaces():
    # String with only spaces should return empty string
    text = "     "
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 5.98μs -> 6.55μs (8.71% slower)


def test_edge_only_newlines_and_nbsp():
    # String with only newlines and non-breaking spaces
    text = "\n\xa0\n\xa0"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 6.54μs -> 6.06μs (7.91% faster)


def test_edge_single_character():
    # Single character should remain unchanged
    text = "A"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 6.01μs -> 6.45μs (6.78% slower)


def test_edge_all_whitespace_between_words():
    # All whitespace between words should be reduced to one space
    text = "A   \n\xa0   B"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 13.4μs -> 9.08μs (47.9% faster)


def test_edge_whitespace_at_various_positions():
    # Whitespace at start, middle, and end
    text = "   A  B   "
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 9.58μs -> 8.26μs (16.0% faster)


def test_edge_multiple_consecutive_whitespace_groups():
    # Several groups of consecutive whitespace
    text = "A  \n\n  B    C"
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 12.9μs -> 9.42μs (37.3% faster)


# 3. Large Scale Test Cases


def test_large_long_string_with_regular_spacing():
    # Large string with regular words and single spaces
    text = "word " * 200
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text.strip()
    )  # 107μs -> 95.9μs (12.2% faster)


def test_large_long_string_with_extra_spaces():
    # Large string with extra spaces between words
    text = ("word    " * 200).strip()
    cleaned, indices = clean_extra_whitespace_with_index_run(text)  # 402μs -> 180μs (123% faster)


def test_large_mixed_whitespace():
    # Large string with mixed whitespace types
    words = ["word"] * 500
    text = " \n\xa0 ".join(words)
    cleaned, indices = clean_extra_whitespace_with_index_run(text)  # 1.37ms -> 598μs (129% faster)


def test_large_leading_and_trailing_whitespace():
    # Large string with leading and trailing whitespace
    text = " " * 100 + "word " * 800 + " " * 100
    cleaned, indices = clean_extra_whitespace_with_index_run(text)  # 468μs -> 374μs (25.1% faster)


def test_large_string_all_whitespace():
    # Large string of only whitespace
    text = " " * 999
    cleaned, indices = clean_extra_whitespace_with_index_run(
        text
    )  # 13.8μs -> 8.85μs (55.9% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from unstructured.cleaners.core import clean_extra_whitespace_with_index_run


def test_clean_extra_whitespace_with_index_run():
    clean_extra_whitespace_with_index_run("\n\x00")
⏪ Click to see Replay Tests
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
test_benchmark1_py__replay_test_0.py::test_unstructured_cleaners_core_clean_extra_whitespace_with_index_run 376μs 347μs 8.63%✅
🔎 Click to see Concolic Coverage Tests
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
codeflash_concolic_3yq4ufg_/tmp5dfyu5tu/test_concolic_coverage.py::test_clean_extra_whitespace_with_index_run 27.1μs 17.7μs 52.7%✅

To edit these changes git checkout codeflash/optimize-clean_extra_whitespace_with_index_run-mji60td0 and push.

Codeflash Static Badge

codeflash-ai bot and others added 11 commits December 23, 2025 05:49
The optimized code achieves a **68% speedup** through two key changes that eliminate expensive operations in the main loop:

## What Changed

1. **Character replacement optimization**: Replaced `re.sub(r"[\xa0\n]", " ", text)` with `text.translate()` using a translation table. This avoids regex compilation and pattern matching for simple character substitutions.

2. **Main loop optimization**: Eliminated two `re.match()` calls per iteration by:
   - Pre-computing character comparisons (`c_orig = text_chars[original_index]`)
   - Using set membership (`c_orig in ws_chars`) instead of regex matching
   - Direct character comparison (`c_clean == ' '`) instead of regex

## Why It's Faster

Looking at the line profiler data, the original code spent **15.4% of total time** (10.8% + 4.6%) on regex matching inside the loop:
- `bool(re.match("[\xa0\n]", text[original_index]))` - 7.12ms (10.8%)
- `bool(re.match(" ", cleaned_text[cleaned_index]))` - 3.02ms (4.6%)

The optimized version replaces these with:
- Set membership check: `c_orig in ws_chars` - 1.07ms (1.4%)
- Direct comparison: `c_clean == ' '` (included in same line)

**Result**: Regex overhead is eliminated, saving ~9ms per 142 invocations in the benchmark.

## Performance Profile

The annotated tests show the optimization excels when:
- **Large inputs with whitespace**: `test_large_leading_and_trailing_whitespace` shows 291% speedup (203μs → 52.1μs)
- **Many consecutive whitespace characters**: `test_large_mixed_whitespace_everywhere` shows 297% speedup (189μs → 47.8μs)
- **Mixed whitespace types** (spaces, newlines, nbsp): `test_edge_all_whitespace_between_words` shows 47.9% speedup

Small inputs with minimal whitespace see minor regressions (~5-17% slower) due to setup overhead, but these are negligible in absolute terms (< 2μs difference).

## Impact on Production Workloads

The function is called in `_process_pdfminer_pages()` during PDF text extraction, processing **every text snippet on every page**. Given that PDFs often contain:
- Multiple spaces/tabs between words
- Newlines from paragraph breaks
- Non-breaking spaces from formatting

This optimization will provide substantial cumulative benefits when processing large documents with hundreds of pages, as the per-snippet savings compound across the entire document.
@qued
Copy link
Contributor

qued commented Jan 7, 2026

@claude review this please.

@claude
Copy link

claude bot commented Jan 7, 2026

Claude encountered an error —— View job

Failed with exit code 128

I'll analyze this and get back to you.

@qued qued merged commit 03be9c0 into Unstructured-IO:main Jan 8, 2026
39 of 40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants