[WIP] Hybrid layout design for HashJoin/Sort #119
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Issue Number: #11
Type of Change
Description
Currently HashJoin and Sort operations store data in row-based RowContainer, which incurs non-trivial layout conversion overhead. This PR introduces a hybrid storage design that keeps payload columns in their original columnar format while only storing keys in RowContainer, reducing this layout conversion overhead.
Main Changes
HybridContainer
RowContainer) from payload storage (kept asRowVectorPtr).HybridRowIdto reference payload rows.HybridRowIdencodes{containerId, rowId}to support multi-driver parallel execution in HashJoin.Multi-driver support in HashJoin
allContainers_map enables cross-container payload extraction during the probe phase.Extraction optimizations
coalesceBatches()flattens multiple payload batches into a single contiguous batch to reduce TLB misses during extraction.sortByContainerId()reorders rows by containerId before extraction to improve cache locality in multi-container scenarios.isSingleContainer()provides a fast path that skips sorting overhead in the single-driver scenario.Configuration options
hybrid_join_enabled/hybrid_sort_enabledto opt in to hybrid execution.hybrid_join_reorder_enabledto control row reordering (disabled in tests to preserve deterministic output).Performance Impact
No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).
Positive Impact: I have run benchmarks.
Click to view Benchmark Results
Negative Impact: Explained below (e.g., trade-off for correctness).
Tested on bolt_tpch_benchmark with sf=10 iteration=5. The performances are almost the same.
Release Note
Please describe the changes in this PR
Release Note:
Checklist (For Author)
Breaking Changes
No
Yes (Description: ...)
Click to view Breaking Changes