-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Prefetching Primitives #38
Merged
Merged
Commits on Nov 3, 2023
-
Improve Prefetching Primitives
At the moment our ROF backend doesn't perform better than the regular JIT backend. We need to get to a point where the vectorized hash table lookups actually help our overall performance. This commit makes a first step in this direction: Improve hash/prefetch primitives: Until now we had a single primitive that would hash, and then another primitive that would prefetch based on the hash. Looking at the ROF paper their prefetching strategy is actually different: they fuse hashing and prefetching into a single primitive. This allows overlapping memory loads with computation, leading to better CPU utilization. We now do the same and have a fused hash/prefetch primitive.
Configuration menu - View commit details
-
Copy full SHA for cb9a38e - Browse repository at this point
Copy the full SHA cb9a38eView commit details -
Allow Inlining C++ Runtime code in Vectorized Primitives
This commit allows inlining our C++ runtime code within vectorized primitives. We use the link-time optimization of clang for this: we compile a new `inkfuse_runtime_static` library with `-flto`. When generating our C code for the interpreter and compiling it to a shared library, we now link the objects from the `inkfuse_runtime_static` library into the generated fragments. This effectively removes the C/C++ compile-time/runtime optimization boundary that existed so far. Up until now we would always generate call statements into the runtime system. We can now effectively inline functions from our runtime system into the generated primitives.
Configuration menu - View commit details
-
Copy full SHA for 2b92b11 - Browse repository at this point
Copy the full SHA 2b92b11View commit details -
Configuration menu - View commit details
-
Copy full SHA for dcfea3f - Browse repository at this point
Copy the full SHA dcfea3fView commit details
Commits on Nov 4, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 5312210 - Browse repository at this point
Copy the full SHA 5312210View commit details -
This commit introduces specialized hash primitives for fixed-width keys. This is one of the biggest performance improvements we've shipped so far for the vectorized interpreter. We now **for the first time manage to have vectorization outperform compilation for some queries**. This is possible because due to our new link time optimizations, the vectorized primitives completely inline the xxhash function calls. This leads to much more efficient hashing and prefetching. Numbers for the interpreter at SF5: ``` Q1: 146ms -> 146ms Q3: 130ms -> 68ms Q4: 108ms -> 36ms Q6: 24ms -> 24ms Q14: 19ms -> 18ms Q18: 1450ms -> 1270ms ``` Basically any query with large hash tables now benefits a lot from the improved hashing primitives.
Configuration menu - View commit details
-
Copy full SHA for 2c18a25 - Browse repository at this point
Copy the full SHA 2c18a25View commit details -
This commit refines our Hybrid backend. So far, we always were in a position where the JIT compiled code performs best. As a result, the hybrid backend wasn't very dynamic. As we now have a much faster interpreter, dynamic switching becomes much more important. We now collect exponentially decaying averages of the tuple throughput of the different backend to use as a decision on which backend to prefer. By default, we try 10% of morsels in the Interpreter, 10% of morsels in the JIT compiled code (once it's ready), and the remaining 80% are run in the backend with higher tuple throughput. If we notice that one backend perform much better however we dynamically increase the fraction of morsels we are running in the faster backend. In extreme cases, we only run 2.5% of morsels in the slower backend.
Configuration menu - View commit details
-
Copy full SHA for fa13c95 - Browse repository at this point
Copy the full SHA fa13c95View commit details -
While doing full LTO during JIT compilation is too expensive to make it worth it for moderate scale factors (<=100GB), we can add link time optimization to the PackDB binary itself. This commit allows link-time optimization in PackDB, and specifically across the compilation boundary with XXHASH. This now also allows the runtime functions in the regular PackDB binary to inline the xxhash computation, significantly improving performance.
Configuration menu - View commit details
-
Copy full SHA for 95675f1 - Browse repository at this point
Copy the full SHA 95675f1View commit details -
Configuration menu - View commit details
-
Copy full SHA for 7988370 - Browse repository at this point
Copy the full SHA 7988370View commit details
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.