Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Prefetching Primitives #38

Merged
merged 8 commits into from
Nov 4, 2023
Merged

Improve Prefetching Primitives #38

merged 8 commits into from
Nov 4, 2023

Conversation

wagjamin
Copy link
Owner

@wagjamin wagjamin commented Nov 3, 2023

At the moment our ROF backend doesn't perform better than the regular JIT backend. We need to get to a point where the vectorized hash table lookups actually help our overall performance. This PR makes steps in this direction:

Improve hash/prefetch primitives:

Until now we had a single primitive that would hash, and then another
primitive that would prefetch based on the hash. Looking at the ROF
paper their prefetching strategy is actually different: they fuse
hashing and prefetching into a single primitive. This allows
overlapping memory loads with computation, leading to better CPU
utilization. We now do the same and have a fused hash/prefetch
primitive.

Allow Inlining C++ Runtime Code in Vectorized Primitives

We use the link-time optimization of clang for this: we compile a new
inkfuse_runtime_static library with -flto.

When generating our C code for the interpreter and compiling it to a
shared library, we now link the objects from the
inkfuse_runtime_static library into the generated fragments.

This effectively removes the C/C++ compile-time/runtime optimization
boundary that existed so far. Up until now we would always generate call
statements into the runtime system.
We can now effectively inline functions from our runtime system into the
generated primitives.

Allow Inlining XXHASH calls in Vecotrized Prefetch Primitives

We add specialized hash primitives for fixed-width keys.
This is one of the biggest performance improvements we've shipped so far
for the vectorized interpreter.

We now for the first time manage to have vectorization outperform
compilation for some queries
. This is possible because due to our new
link time optimizations, the vectorized primitives completely inline the
xxhash function calls. This leads to much more efficient hashing and
prefetching.

Improve the Hybrid Backend

We refine our Hybrid backend. So far, we always were in a
position where the JIT compiled code performs best. As a result, the
hybrid backend wasn't very dynamic.

As we now have a much faster interpreter, dynamic switching becomes much
more important.

We now collect exponentially decaying averages of the tuple throughput
of the different backend to use as a decision on which backend to prefer.

By default, we try 10% of morsels in the Interpreter, 10% of morsels in
the JIT compiled code (once it's ready), and the remaining 80% are run
in the backend with higher tuple throughput.
If we notice that one backend perform much better however we dynamically
increase the fraction of morsels we are running in the faster backend.
In extreme cases, we only run 2.5% of morsels in the slower backend.

At the moment our ROF backend doesn't perform better than the regular
JIT backend. We need to get to a point where the vectorized hash table
lookups actually help our overall performance. This commit makes a first
step in this direction:

Improve hash/prefetch primitives:
   Until now we had a single primitive that would hash, and then another
   primitive that would prefetch based on the hash. Looking at the ROF
   paper their prefetching strategy is actually different: they fuse
   hashing and prefetching into a single primitive. This allows
   overlapping memory loads with computation, leading to better CPU
   utilization. We now do the same and have a fused hash/prefetch
   primitive.
This commit allows inlining our C++ runtime code within vectorized
primitives.

We use the link-time optimization of clang for this: we compile a new
`inkfuse_runtime_static` library with `-flto`.

When generating our C code for the interpreter and compiling it to a
shared library, we now link the objects from the
`inkfuse_runtime_static` library into the generated fragments.

This effectively removes the C/C++ compile-time/runtime optimization
boundary that existed so far. Up until now we would always generate call
statements into the runtime system.
We can now effectively inline functions from our runtime system into the
generated primitives.
This commit introduces specialized hash primitives for fixed-width keys.
This is one of the biggest performance improvements we've shipped so far
for the vectorized interpreter.

We now **for the first time manage to have vectorization outperform
compilation for some queries**. This is possible because due to our new
link time optimizations, the vectorized primitives completely inline the
xxhash function calls. This leads to much more efficient hashing and
prefetching.

Numbers for the interpreter at SF5:
```
Q1:   146ms ->  146ms
Q3:   130ms ->   68ms
Q4:   108ms ->   36ms
Q6:    24ms ->   24ms
Q14:   19ms ->   18ms
Q18: 1450ms -> 1270ms
```

Basically any query with large hash tables now benefits a lot from the
improved hashing primitives.
This commit refines our Hybrid backend. So far, we always were in a
position where the JIT compiled code performs best. As a result, the
hybrid backend wasn't very dynamic.

As we now have a much faster interpreter, dynamic switching becomes much
more important.

We now collect exponentially decaying averages of the tuple throughput
of the different backend to use as a decision on which backend to prefer.

By default, we try 10% of morsels in the Interpreter, 10% of morsels in
the JIT compiled code (once it's ready), and the remaining 80% are run
in the backend with higher tuple throughput.
If we notice that one backend perform much better however we dynamically
increase the fraction of morsels we are running in the faster backend.
In extreme cases, we only run 2.5% of morsels in the slower backend.
While doing full LTO during JIT compilation is too expensive to make it
worth it for moderate scale factors (<=100GB), we can add link time
optimization to the PackDB binary itself.

This commit allows link-time optimization in PackDB, and specifically
across the compilation boundary with XXHASH. This now also allows the
runtime functions in the regular PackDB binary to inline the xxhash
computation, significantly improving performance.
@wagjamin wagjamin merged commit 8a44c3a into main Nov 4, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant