Improve Prefetching Primitives #38

At the moment our ROF backend doesn't perform better than the regular JIT backend. We need to get to a point where the vectorized hash table lookups actually help our overall performance. This commit makes a first step in this direction: Improve hash/prefetch primitives: Until now we had a single primitive that would hash, and then another primitive that would prefetch based on the hash. Looking at the ROF paper their prefetching strategy is actually different: they fuse hashing and prefetching into a single primitive. This allows overlapping memory loads with computation, leading to better CPU utilization. We now do the same and have a fused hash/prefetch primitive.

This commit allows inlining our C++ runtime code within vectorized primitives. We use the link-time optimization of clang for this: we compile a new `inkfuse_runtime_static` library with `-flto`. When generating our C code for the interpreter and compiling it to a shared library, we now link the objects from the `inkfuse_runtime_static` library into the generated fragments. This effectively removes the C/C++ compile-time/runtime optimization boundary that existed so far. Up until now we would always generate call statements into the runtime system. We can now effectively inline functions from our runtime system into the generated primitives.

This commit introduces specialized hash primitives for fixed-width keys. This is one of the biggest performance improvements we've shipped so far for the vectorized interpreter. We now **for the first time manage to have vectorization outperform compilation for some queries**. This is possible because due to our new link time optimizations, the vectorized primitives completely inline the xxhash function calls. This leads to much more efficient hashing and prefetching. Numbers for the interpreter at SF5: ``` Q1: 146ms -> 146ms Q3: 130ms -> 68ms Q4: 108ms -> 36ms Q6: 24ms -> 24ms Q14: 19ms -> 18ms Q18: 1450ms -> 1270ms ``` Basically any query with large hash tables now benefits a lot from the improved hashing primitives.

This commit refines our Hybrid backend. So far, we always were in a position where the JIT compiled code performs best. As a result, the hybrid backend wasn't very dynamic. As we now have a much faster interpreter, dynamic switching becomes much more important. We now collect exponentially decaying averages of the tuple throughput of the different backend to use as a decision on which backend to prefer. By default, we try 10% of morsels in the Interpreter, 10% of morsels in the JIT compiled code (once it's ready), and the remaining 80% are run in the backend with higher tuple throughput. If we notice that one backend perform much better however we dynamically increase the fraction of morsels we are running in the faster backend. In extreme cases, we only run 2.5% of morsels in the slower backend.

While doing full LTO during JIT compilation is too expensive to make it worth it for moderate scale factors (<=100GB), we can add link time optimization to the PackDB binary itself. This commit allows link-time optimization in PackDB, and specifically across the compilation boundary with XXHASH. This now also allows the runtime functions in the regular PackDB binary to inline the xxhash computation, significantly improving performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Prefetching Primitives #38

Improve Prefetching Primitives #38

Commits on Nov 3, 2023

Commits on Nov 4, 2023