Skip to content

Conversation

@sgomezvillamor
Copy link
Contributor

Extends Rob's regex optimization pattern (#15463) to additional ingestion hot paths:

  1. SqlQueriesSource: Pre-compile temp_table_patterns using @cached_property

    • Called for every table during query processing
    • Eliminates repeated regex compilation overhead
  2. BigQuery: Pre-compile sharded table & wildcard patterns at module level

    • get_table_and_shard(): Called for every BigQuery table
    • get_table_display_name(): Called for table name normalization
    • is_sharded_table(): Called during table classification
  3. PowerBI ODBC: Pre-compile platform detection patterns at module level

    • normalize_platform_from_driver(): Called for every ODBC connection
    • normalize_platform_name(): Called during platform normalization
    • Affects 18+ database platform patterns

All changes follow the same optimization strategy as #15463:

  • Compile regex patterns once at initialization
  • Use compiled Pattern objects in hot path
  • Maintain exact behavioral equivalence
  • No config changes or breaking changes

Expected impact: Performance improvement for ingestion workloads with:

  • High volume of temp table checks (SqlQueriesSource)
  • Large BigQuery datasets with sharded tables
  • PowerBI sources with many ODBC connections

🤖 Generated with Claude Code

Extends Rob's regex optimization pattern (#15463) to additional ingestion hot paths:

1. **SqlQueriesSource**: Pre-compile temp_table_patterns using @cached_property
   - Called for every table during query processing
   - Eliminates repeated regex compilation overhead

2. **BigQuery**: Pre-compile sharded table & wildcard patterns at module level
   - get_table_and_shard(): Called for every BigQuery table
   - get_table_display_name(): Called for table name normalization
   - is_sharded_table(): Called during table classification

3. **PowerBI ODBC**: Pre-compile platform detection patterns at module level
   - normalize_platform_from_driver(): Called for every ODBC connection
   - normalize_platform_name(): Called during platform normalization
   - Affects 18+ database platform patterns

All changes follow the same optimization strategy as #15463:
- Compile regex patterns once at initialization
- Use compiled Pattern objects in hot path
- Maintain exact behavioral equivalence
- No config changes or breaking changes

Expected impact: Performance improvement for ingestion workloads with:
- High volume of temp table checks (SqlQueriesSource)
- Large BigQuery datasets with sharded tables
- PowerBI sources with many ODBC connections

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Dec 3, 2025
@codecov
Copy link

codecov bot commented Dec 3, 2025

Codecov Report

❌ Patch coverage is 78.94737% with 4 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...c/datahub/ingestion/source/powerbi/m_query/odbc.py 50.00% 3 Missing ⚠️
...estion/src/datahub/ingestion/source/sql_queries.py 87.50% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Dec 3, 2025
@ligfx
Copy link
Contributor

ligfx commented Dec 3, 2025

I'm curious what the performance improvement seen from this is? Python caches compiled regular expressions in a decently large LRU cache (~100 items iirc?) anyways, so any improvement is likely going to be from skipping a cache lookup each time.

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Dec 3, 2025
@sgomezvillamor
Copy link
Contributor Author

I'm curious what the performance improvement seen from this is? Python caches compiled regular expressions in a decently large LRU cache (~100 items iirc?) anyways, so any improvement is likely going to be from skipping a cache lookup each time.

@rob-1019 did some benchmark

@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Dec 4, 2025
@rob-1019
Copy link

rob-1019 commented Dec 4, 2025

Have not looked into the specifics of what you get automatically without compiling but explicitly compiling is visible in both the real world and the micro-benchmarks even with a small number of regexes even with a single regex.

Benchmarking 100000 random lookups over 100 patterns...

Aho–Corasick (exact match)      time=  0.8169s  hits=100000
Regex uncompiled / re.match     time=  2.4722s  hits=100000
Regex precompiled objects       time=  0.6782s  hits=100000
Big OR regex (uncompiled)       time=  0.0901s  hits=100000
Big OR regex (compiled)         time=  0.0543s  hits=100000
Plain set membership            time=  0.0099s  hits=100000

Even at 10 it shows.

tring>:95: SyntaxWarning: invalid escape sequence '\Z'
Building data structures...
Pattern lengths build        :   0.0000s
Regex literal strings prep   :   0.0001s
Compile per-pattern regexes  :   0.0007s
Build big OR regex pattern   :   0.0001s
Compile big OR regex         :   0.0006s
Build set()                  :   0.0000s
Build Aho–Corasick automaton :   0.0012s

Benchmarking 100000 random lookups over 10 patterns...

Aho–Corasick (exact match)      time=  0.8002s  hits=100000
Regex uncompiled / re.match     time=  0.3085s  hits=100000
Regex precompiled objects       time=  0.0995s  hits=100000
Big OR regex (uncompiled)       time=  0.0743s  hits=100000
Big OR regex (compiled)

@ligfx
Copy link
Contributor

ligfx commented Dec 4, 2025

Neat, so using the pre-compiled regex is about 3–4x faster by avoiding the cache lookup.

In absolute terms, looks like that's about 2 millionths of a second faster per regex match. So for filtering a source with 2 million tables, ingestion will be 1 second faster.

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Dec 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata pending-submitter-response Issue/request has been reviewed but requires a response from the submitter

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants