Skip to content

Conversation

@JkSelf
Copy link

@JkSelf JkSelf commented Jan 16, 2026

What problem does this PR solve?

Issue Number: close #149

Type of Change

  • πŸ› Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • πŸš€ Performance improvement (optimization)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to change)
  • πŸ”¨ Refactoring (no logic changes)
  • πŸ”§ Build/CI or Infrastructure changes
  • πŸ“ Documentation only

Description

In Spark, the hash table is constructed only once in the driver and then broadcasted to each executor. In Gluten, we replace the broadcast hash join with a hash join, which leads to performance issues when the broadcast threshold increases. This is because each task must build its own hash table when using hash join.

This PR enables HashBuild to accept pre-built HashTable, thereby bypassing the hash table construction process. Additionally, it modifies HashProbe to avoid clearing the shared hash table after use.

Performance Impact

  • No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).

  • Positive Impact: I have run benchmarks.

    Click to view Benchmark Results
    Paste your google-benchmark or TPC-H results here.
    Before: 10.5s
    After:   8.2s  (+20%)
    
  • Negative Impact: Explained below (e.g., trade-off for correctness).

Release Note

Please describe the changes in this PR

Release Note:

Release Note:
- Fixed a crash in `substr` when input is null.
- optimized `group by` performance by 20%.

Checklist (For Author)

  • I have added/updated unit tests (ctest).
  • I have verified the code with local build (Release/Debug).
  • I have run clang-format / linters.
  • (Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
  • No need to test or manual test.

Breaking Changes

  • No

  • Yes (Description: ...)

    Click to view Breaking Changes
    Breaking Changes:
    - Description of the breaking change.
    - Possible solutions or workarounds.
    - Any other relevant information.
    

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Ke Jia seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@frankobe
Copy link
Collaborator

@JkSelf Thx 4 the contribution! Can you add unit tests to cover the change

@wangxinshuo-bolt
Copy link
Collaborator

@JkSelf Can you explain this benchmark in more detail? For example, what is the test dataset and what is the scale factor? Is it a gain for a single query or an overall gain? Also, please sign the Contributor License, thank you!

Paste your google-benchmark or TPC-H results here.
Before: 10.5s
After:   8.2s  (+20%)

* --------------------------------------------------------------------------
*/

#include <boost/sort/pdqsort/pdqsort.hpp>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why move boost headers to first?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why move boost headers to first?

It may be clang-format that sorted the header files.


const bool nullAware_;

void* reusedHashTableAddress_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will made HashJoinNode cannot be serialized acrossβ€Œ process and machines. So better to add some comment or CHECK

std::unique_ptr<
exec::BaseHashTable,
std::function<void(exec::BaseHashTable*)>>
hashTable(nullptr, [](exec::BaseHashTable* ptr) { /* Do nothing */ });
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a litter hacky


// True if this is a build side of an anti or left semi project join and has
// at least one entry with null join keys.
bool joinHasNullKeys_{false};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool joinHasNullKeys_{false};
std::atomic<bool> joinHasNullKeys_{false};

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Enable the hash join to accept a pre-built hash table for joining

6 participants