Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(query): push rank limit into aggregate partial node #16466

Merged
merged 7 commits into from
Sep 19, 2024

Conversation

sundy-li
Copy link
Member

@sundy-li sundy-li commented Sep 19, 2024

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

This pr pushes rank limit into aggregate partial node.

For example:

SELECT UserID, SearchPhrase, COUNT(*) FROM hits GROUP BY UserID, SearchPhrase ORDER BY UserID,  SearchPhrase 
 LIMIT 10;

As the group keys may have high cardinality, the HT will resize multiple times to hold the coming keys which is inefficient.

Now we will filter group keys using partial sort with LimitType::Rank(n) to keep top n unique keys.

So we inject a SortPartialTransform before partial agg now.

🐳 :) explain pipeline SELECT UserID, SearchPhrase, COUNT(*) FROM hits GROUP BY UserID, SearchPhrase ORDER BY UserID,  SearchPhrase
 LIMIT 10;
-[ EXPLAIN ]-----------------------------------
CompoundBlockOperator(Project) × 1
  LimitTransform × 1
    Merge to MultiSortMerge × 1
      TransformSortSpill × 16
        TransformSortMergeLimit × 16
          SortPartialTransform × 16
            TransformFinalAggregate × 16
              TransformSpillReader × 16
                Merge to Resize × 16
                  Merge to TransformPartitionBucket × 1
                    TransformAggregateSpillWriter × 16
                      TransformPartialAggregate × 16
                        SortPartialTransform × 16
                          NativeDeserializeDataTransform × 16
                            SyncReadNativeDataSource × 16

Performance in my PC is 2.8 sec --> 0.35 sec.

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Sep 19, 2024
@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Sep 19, 2024
@dosubot dosubot bot added A-query Area: databend query C-performance Category: Performance labels Sep 19, 2024
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Sep 19, 2024
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Sep 19, 2024
@Dousir9 Dousir9 added this pull request to the merge queue Sep 19, 2024
@BohuTANG BohuTANG removed this pull request from the merge queue due to a manual request Sep 19, 2024
@BohuTANG BohuTANG merged commit 1fac99d into datafuselabs:main Sep 19, 2024
78 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-query Area: databend query C-performance Category: Performance lgtm This PR has been approved by a maintainer pr-feature this PR introduces a new feature to the codebase size:XXL This PR changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants