-
-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
simd: reapply and fix split cursor advancing #175
base: master
Are you sure you want to change the base?
Conversation
Note, cursory benches for this on M1 / arm64 show a ~10% perf regression vs |
@seanmonstar I can take a deeper look at this before reconsidering landing it, haven't looked into it in depth, but potentially there could be gains on avx2 simd from this approach because it avoids reprocessing an entire/last block on SIMD miss, allowing the SWAR or byte-wise code to pick up. Since avx2 operates on 32B/256bit blocks it possibly "misses" a lot on header-values, short URLs etc... So definitely a variant of this is worth considering, but I think it should be rigorously benchmarked given those shared in #156 were unclear... @seanmonstar probably worth adding CI benches (which can be noisy, but probably not meaningfully so these pure compute bound flows) |
Also a big factor here could be @lucab Also are all the |
Did a first pass at CI benches: #179, this PR should show a measurable delta |
@AaronO from #156 (comment):
|
7de9ddb
to
2262ff1
Compare
@lucab Yes I think that should be split out. Especially since the original PR made substantial (AVX2) perf claims, it's best if we can focus on that and remove unrelated "noise". The alt refactor should be judged on its own merits. The focus here should be demonstrating, explaining clear perf benefits. From the benchmarks (local [arm64] & CI) that appears unclear. |
@lucab Modern CPUs with speculative execution can counterintuitively be faster "processing the same bytes twice" if both execution paths are independent. There can also be tradeoffs impacting inline-ability if you increase register usage in hot loop, etc... I touched this code a while back so not 100% sure, but I think I explored your change (if I'm distilling it correctly to returning partial matches) and concluded (based off local M1 benches at the time) that it wasn't a net positive. For example your original PR #156, explains the improvement with:
I should double check but IIRC, minimizing the amount of times you update the bytes-cursor sounds like a plausible improvement but in practice it impacts lowering. What actually happens here is that the "stack-allocated" Long story short, we should really distill this perf improvement hypothesis to its essentials and fully test/understand that. |
@lucab Ok looking at the CI benches there does appear to be a beyind-noise improvement for It smells like it might boil down to inline-ability of In part it's a perf bug vs purely raising the ceiling, especially for small values |
@AaronO Ack, I'll split the For context, I wasn't really looking at perf improvements when I opened this, but it happened as a local side effect:
|
This has massive implications on the default runtime perf, improving how the code is lowered/inlined. (Falling back to SSE4.2 for a handful of bytes was wasteful). Should supersede seanmonstar#175, seanmonstar#156
This reapplies #156 with an additional commit on top in order to fix #172.