-
Notifications
You must be signed in to change notification settings - Fork 41
[Optimization] Recluster RowContainer in HashJoin Array Mode to impro… #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
5268135 to
9b97269
Compare
3f5e8a9 to
fe7a1cd
Compare
| if (typeKind == TypeKind::ROW || typeKind == TypeKind::ARRAY || | ||
| typeKind == TypeKind::MAP) { | ||
| auto sourceView = valueAt<std::string_view>(sourceRow, col.offset()); | ||
| if (!sourceView.empty()) { | ||
| RowSizeTracker tracker( | ||
| targetRow[rowSizeOffset_], targetStringAllocator); | ||
| targetStringAllocator.copyMultipart( | ||
| StringView(sourceView.data(), sourceView.size()), | ||
| targetRow, | ||
| col.offset()); | ||
| } | ||
| } else if ( | ||
| typeKind == TypeKind::VARCHAR || typeKind == TypeKind::VARBINARY) { | ||
| StringView sourceView = valueAt<StringView>(sourceRow, col.offset()); | ||
| if (!sourceView.isInline()) { | ||
| RowSizeTracker tracker( | ||
| targetRow[rowSizeOffset_], targetStringAllocator); | ||
| targetStringAllocator.copyMultipart( | ||
| sourceView, targetRow, col.offset()); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no deed to copy too large string? as the internal memory may also be discontinous
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also considering skipping the reordering/copying for large string keys.
For large strings, the performance bottleneck during the probe phase shifts from row locality to memory bandwidth and the cost of string comparison. Even if we physically reorder the rows, loading large strings into the cache effectively 'pollutes' the cache lines anyway, offering diminishing returns on locality. Therefore, the cost of copying these huge strings during the build phase likely outweighs the minimal gain in the probe phase.
| if (adaptive_ && allowPreload_.load()) { | ||
| allowPreload_ = false; | ||
| LOG(WARNING) << "Disallow scan preload due to limited memory"; | ||
| LOG(INFO) << "Disallow scan preload due to limited memory"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
disallowing preload is unexpected, so I preperf waning msg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This log appears excessively when running queries like TPC-DS in CLI. Using WARNING causes significant noise in the CLI (especially in spark-sql/shell).
I switched to INFO to prevent it from polluting the console output, while ensuring it is still recorded in the log files.
| } else if ( | ||
| reclusterConfig_.reclusterMode == | ||
| HashTableReclusterConfig::ReclusterMode::kHash) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try to reduce memory cost?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the goal isn't to save memory. Both methods are designed to improve cache locality during the probe phase by physically reordering the RowContainer.
The sort-based method guarantees that identical keys are stored continuously, but it introduces significant overhead about 7x slower than hash.
The hash-based method is a trade-off: while it doesn't guarantee strict continuity like sorting, it effectively clusters similar keys with much lower computational overhead. It provides a better balance between build time and probe performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean it needs to reduce memory cost here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inplace recluster is much slower than copy the clustered data and then replace rowcontainer
bolt/exec/HashTable.cpp
Outdated
| return; | ||
| } | ||
|
|
||
| if (!reclusterConfig_.enableArrayRecluster) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this line to the beginning of this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
fe7a1cd to
43d0716
Compare
…ve cache locality Introduces an optimization for Hash Join in Array mode. When the build side has low cardinality but a high number of rows (high duplication factor), the linked lists in the hash table often point to scattered memory locations in the RowContainer. This causes severe cache thrashing during the probe phase.
43d0716 to
bc48b0f
Compare
What problem does this PR solve?
Issue Number: close #51
Type of Change
Description
Summary:
This PR introduces an optimization for Hash Join in Array mode. When the build side has low cardinality but a high number of rows (high duplication factor), the linked lists in the hash table often point to scattered memory locations in the RowContainer. This causes severe cache thrashing during the probe phase.
This change adds a tryRecluster step in HashTable::prepareJoinTable. If the duplication ratio exceeds a threshold, it creates a new, sorted RowContainer where rows with the same key are stored contiguously.
Implementation Details:
Added QueryConfig for enable_hash_join_array_recluster.
Implemented reclusterDataByKey in HashTable or by sort.
Implemented RowContainer::cloneByOrder to deep copy rows into a new container based on the sorted order.
Added logic to estimate Probe side row count to decide if optimization costs are justified.
Performance Impact
Click to view Benchmark Results
Performance Results (TPC-DS 100G):
Environment:
BF config:
Pre-sorted: Sort inventory by inv_item_sk
Target Query: Q72 (High duplication, low cardinality join).
For simple query on tpcds 100G
Baseline: 1711.516s
New: 178.003s
Analysis:
RowContaineryields an additional ~27% speedup (630s -> 496s) because it guarantees physical compactness better than the upstream iterator order (which might be chunked).Release Note
Please describe the changes in this PR
Release Note:
Checklist (For Author)
Breaking Changes
Click to view Breaking Changes