-
Notifications
You must be signed in to change notification settings - Fork 3.3k
perf(ingestion): pre-compile regex patterns in hot paths #15470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Extends Rob's regex optimization pattern (#15463) to additional ingestion hot paths: 1. **SqlQueriesSource**: Pre-compile temp_table_patterns using @cached_property - Called for every table during query processing - Eliminates repeated regex compilation overhead 2. **BigQuery**: Pre-compile sharded table & wildcard patterns at module level - get_table_and_shard(): Called for every BigQuery table - get_table_display_name(): Called for table name normalization - is_sharded_table(): Called during table classification 3. **PowerBI ODBC**: Pre-compile platform detection patterns at module level - normalize_platform_from_driver(): Called for every ODBC connection - normalize_platform_name(): Called during platform normalization - Affects 18+ database platform patterns All changes follow the same optimization strategy as #15463: - Compile regex patterns once at initialization - Use compiled Pattern objects in hot path - Maintain exact behavioral equivalence - No config changes or breaking changes Expected impact: Performance improvement for ingestion workloads with: - High volume of temp table checks (SqlQueriesSource) - Large BigQuery datasets with sharded tables - PowerBI sources with many ODBC connections 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
I'm curious what the performance improvement seen from this is? Python caches compiled regular expressions in a decently large LRU cache (~100 items iirc?) anyways, so any improvement is likely going to be from skipping a cache lookup each time. |
@rob-1019 did some benchmark |
|
Have not looked into the specifics of what you get automatically without compiling but explicitly compiling is visible in both the real world and the micro-benchmarks even with a small number of regexes even with a single regex. Even at 10 it shows. |
|
Neat, so using the pre-compiled regex is about 3–4x faster by avoiding the cache lookup. In absolute terms, looks like that's about 2 millionths of a second faster per regex match. So for filtering a source with 2 million tables, ingestion will be 1 second faster. |
Extends Rob's regex optimization pattern (#15463) to additional ingestion hot paths:
SqlQueriesSource: Pre-compile temp_table_patterns using @cached_property
BigQuery: Pre-compile sharded table & wildcard patterns at module level
PowerBI ODBC: Pre-compile platform detection patterns at module level
All changes follow the same optimization strategy as #15463:
Expected impact: Performance improvement for ingestion workloads with:
🤖 Generated with Claude Code