Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Prefetching Primitives #38

Merged
merged 8 commits into from
Nov 4, 2023
Merged

Improve Prefetching Primitives #38

merged 8 commits into from
Nov 4, 2023

Commits on Nov 3, 2023

  1. Improve Prefetching Primitives

    At the moment our ROF backend doesn't perform better than the regular
    JIT backend. We need to get to a point where the vectorized hash table
    lookups actually help our overall performance. This commit makes a first
    step in this direction:
    
    Improve hash/prefetch primitives:
       Until now we had a single primitive that would hash, and then another
       primitive that would prefetch based on the hash. Looking at the ROF
       paper their prefetching strategy is actually different: they fuse
       hashing and prefetching into a single primitive. This allows
       overlapping memory loads with computation, leading to better CPU
       utilization. We now do the same and have a fused hash/prefetch
       primitive.
    wagjamin committed Nov 3, 2023
    Configuration menu
    Copy the full SHA
    cb9a38e View commit details
    Browse the repository at this point in the history
  2. Allow Inlining C++ Runtime code in Vectorized Primitives

    This commit allows inlining our C++ runtime code within vectorized
    primitives.
    
    We use the link-time optimization of clang for this: we compile a new
    `inkfuse_runtime_static` library with `-flto`.
    
    When generating our C code for the interpreter and compiling it to a
    shared library, we now link the objects from the
    `inkfuse_runtime_static` library into the generated fragments.
    
    This effectively removes the C/C++ compile-time/runtime optimization
    boundary that existed so far. Up until now we would always generate call
    statements into the runtime system.
    We can now effectively inline functions from our runtime system into the
    generated primitives.
    wagjamin committed Nov 3, 2023
    Configuration menu
    Copy the full SHA
    2b92b11 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    dcfea3f View commit details
    Browse the repository at this point in the history

Commits on Nov 4, 2023

  1. Configuration menu
    Copy the full SHA
    5312210 View commit details
    Browse the repository at this point in the history
  2. Specialized Hash Primitives

    This commit introduces specialized hash primitives for fixed-width keys.
    This is one of the biggest performance improvements we've shipped so far
    for the vectorized interpreter.
    
    We now **for the first time manage to have vectorization outperform
    compilation for some queries**. This is possible because due to our new
    link time optimizations, the vectorized primitives completely inline the
    xxhash function calls. This leads to much more efficient hashing and
    prefetching.
    
    Numbers for the interpreter at SF5:
    ```
    Q1:   146ms ->  146ms
    Q3:   130ms ->   68ms
    Q4:   108ms ->   36ms
    Q6:    24ms ->   24ms
    Q14:   19ms ->   18ms
    Q18: 1450ms -> 1270ms
    ```
    
    Basically any query with large hash tables now benefits a lot from the
    improved hashing primitives.
    wagjamin committed Nov 4, 2023
    Configuration menu
    Copy the full SHA
    2c18a25 View commit details
    Browse the repository at this point in the history
  3. Refine Hybrid Backend

    This commit refines our Hybrid backend. So far, we always were in a
    position where the JIT compiled code performs best. As a result, the
    hybrid backend wasn't very dynamic.
    
    As we now have a much faster interpreter, dynamic switching becomes much
    more important.
    
    We now collect exponentially decaying averages of the tuple throughput
    of the different backend to use as a decision on which backend to prefer.
    
    By default, we try 10% of morsels in the Interpreter, 10% of morsels in
    the JIT compiled code (once it's ready), and the remaining 80% are run
    in the backend with higher tuple throughput.
    If we notice that one backend perform much better however we dynamically
    increase the fraction of morsels we are running in the faster backend.
    In extreme cases, we only run 2.5% of morsels in the slower backend.
    wagjamin committed Nov 4, 2023
    Configuration menu
    Copy the full SHA
    fa13c95 View commit details
    Browse the repository at this point in the history
  4. Improve JIT Performance

    While doing full LTO during JIT compilation is too expensive to make it
    worth it for moderate scale factors (<=100GB), we can add link time
    optimization to the PackDB binary itself.
    
    This commit allows link-time optimization in PackDB, and specifically
    across the compilation boundary with XXHASH. This now also allows the
    runtime functions in the regular PackDB binary to inline the xxhash
    computation, significantly improving performance.
    wagjamin committed Nov 4, 2023
    Configuration menu
    Copy the full SHA
    95675f1 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    7988370 View commit details
    Browse the repository at this point in the history