[feat] Enable the hash join to accept a pre-built hash table for joining #150

JkSelf · 2026-01-16T10:45:08Z

What problem does this PR solve?

Issue Number: close #149

Type of Change

🐛 Bug fix (non-breaking change which fixes an issue)
✨ New feature (non-breaking change which adds functionality)
🚀 Performance improvement (optimization)
⚠️ Breaking change (fix or feature that would cause existing functionality to change)
🔨 Refactoring (no logic changes)
🔧 Build/CI or Infrastructure changes
📝 Documentation only

Description

In Spark, the hash table is constructed only once in the driver and then broadcasted to each executor. In Gluten, we replace the broadcast hash join with a hash join, which leads to performance issues when the broadcast threshold increases. This is because each task must build its own hash table when using hash join.

This PR enables HashBuild to accept pre-built HashTable, thereby bypassing the hash table construction process. Additionally, it modifies HashProbe to avoid clearing the shared hash table after use.

Performance Impact

No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).

Positive Impact: I have run benchmarks.

Click to view Benchmark Results

Paste your google-benchmark or TPC-H results here.
Before: 10.5s
After:   8.2s  (+20%)

Negative Impact: Explained below (e.g., trade-off for correctness).

Release Note

Please describe the changes in this PR

Release Note:

Release Note:
- Fixed a crash in `substr` when input is null.
- optimized `group by` performance by 20%.

Checklist (For Author)

I have added/updated unit tests (ctest).
I have verified the code with local build (Release/Debug).
I have run clang-format / linters.
(Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
No need to test or manual test.

Breaking Changes

No

Yes (Description: ...)

Click to view Breaking Changes

Breaking Changes:
- Description of the breaking change.
- Possible solutions or workarounds.
- Any other relevant information.

CLAassistant · 2026-01-16T10:45:14Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

Ke Jia seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

frankobe · 2026-01-16T20:02:00Z

@JkSelf Thx 4 the contribution! Can you add unit tests to cover the change

wangxinshuo-bolt · 2026-01-19T03:05:14Z

@JkSelf Can you explain this benchmark in more detail? For example, what is the test dataset and what is the scale factor? Is it a gain for a single query or an overall gain? Also, please sign the Contributor License, thank you!

Paste your google-benchmark or TPC-H results here.
Before: 10.5s
After:   8.2s  (+20%)

yangzhg · 2026-01-19T08:23:59Z

bolt/exec/HashTable.cpp

 * --------------------------------------------------------------------------
 */

+#include <boost/sort/pdqsort/pdqsort.hpp>


why move boost headers to first?

Why move boost headers to first?

It may be clang-format that sorted the header files.

yangzhg · 2026-01-19T08:37:55Z

bolt/core/PlanNode.h


  const bool nullAware_;
+
+  void* reusedHashTableAddress_;


This will made HashJoinNode cannot be serialized across‌ process and machines. So better to add some comment or CHECK

yangzhg · 2026-01-19T08:38:47Z

bolt/exec/HashBuild.cpp

+      std::unique_ptr<
+          exec::BaseHashTable,
+          std::function<void(exec::BaseHashTable*)>>
+          hashTable(nullptr, [](exec::BaseHashTable* ptr) { /* Do nothing */ });


a litter hacky

yangzhg · 2026-01-19T08:42:44Z

bolt/exec/HashTable.h

+
+  // True if this is a build side of an anti or left semi project join and has
+  // at least one entry with null join keys.
+  bool joinHasNullKeys_{false};


Suggested change

bool joinHasNullKeys_{false};

std::atomic<bool> joinHasNullKeys_{false};

Add BoradcastHashJoin optimization

4b21891

frankobe requested review from guhaiyan0221 and wangxinshuo-bolt January 16, 2026 19:58

yangzhg reviewed Jan 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feat] Enable the hash join to accept a pre-built hash table for joining #150

[feat] Enable the hash join to accept a pre-built hash table for joining #150

JkSelf commented Jan 16, 2026

Uh oh!

CLAassistant commented Jan 16, 2026

Uh oh!

frankobe commented Jan 16, 2026

Uh oh!

wangxinshuo-bolt commented Jan 19, 2026

Uh oh!

yangzhg Jan 19, 2026

Uh oh!

XinShuoWang Jan 19, 2026

Uh oh!

yangzhg Jan 19, 2026

Uh oh!

yangzhg Jan 19, 2026

Uh oh!

yangzhg Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

	bool joinHasNullKeys_{false};
	std::atomic<bool> joinHasNullKeys_{false};

[feat] Enable the hash join to accept a pre-built hash table for joining #150

Are you sure you want to change the base?

[feat] Enable the hash join to accept a pre-built hash table for joining #150

Conversation

JkSelf commented Jan 16, 2026

What problem does this PR solve?

Type of Change

Description

Performance Impact

Release Note

Checklist (For Author)

Breaking Changes

Uh oh!

CLAassistant commented Jan 16, 2026

Uh oh!

frankobe commented Jan 16, 2026

Uh oh!

wangxinshuo-bolt commented Jan 19, 2026

Uh oh!

yangzhg Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

XinShuoWang Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

yangzhg Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

yangzhg Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

yangzhg Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants