-
-
Notifications
You must be signed in to change notification settings - Fork 130
Adding rust/bufchr/memchr/prebuilt benchmark engine
#185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
I think before I'm willing to expand the API and make it more complicated, there needs to be more benchmarking:
This would give us concrete numbers to weigh against the API complexity of more iterators. |
|
Hello @BurntSushi, I have started an early benchmark over there to address the question: https://github.com/Yomguithereal/memchr-iter-bench I am new to the "benchmarking in Rust" game so I used I benchmarked a naive loop of I only benchmarked I only benchmarked the case where you are searching for a single needle yet. Also, the use case here is to report all matches, not only the first one. Some things I seem to observe:
It should also be noted that, for now, So I think |
|
The benchmarks should use rebar. There's already a huge pile of benchmarks in this repo that cover a broad sampling of use cases. |
rust/bufchr/memchr/oneshot benchmark engine
|
I have pivoted the PR to be adding a benchmark engine to the repo using I built using the following command: rebar build -e '^rust/(memchr|bufchr)/memchr/(oneshot|prebuilt|onlycount|fallback|naive)$'So I ran the benchmarks for the following engines: rust/bufchr/memchr/oneshot
rust/memchr/memchr/fallback
rust/memchr/memchr/naive
rust/memchr/memchr/oneshot
rust/memchr/memchr/onlycount
rust/memchr/memchr/prebuiltUsing the following command: rebar measure -e '^rust/(memchr|bufchr)/memchr/(oneshot|prebuilt|onlycount|fallback|naive)$' | tee bench.csvI got the following results: bench.csv
benchmark rust/bufchr/memchr/oneshot rust/memchr/memchr/fallback rust/memchr/memchr/naive rust/memchr/memchr/oneshot rust/memchr/memchr/onlycount rust/memchr/memchr/prebuilt
--------- -------------------------- --------------------------- ------------------------ -------------------------- ---------------------------- ---------------------------
memchr/sherlock/common/huge1 3.3 GB/s (11.32x) 940.1 MB/s (41.09x) 1137.0 MB/s (33.97x) 2.1 GB/s (18.06x) 37.7 GB/s (1.00x) 2.2 GB/s (17.04x)
memchr/sherlock/common/small1 11.0 GB/s (1.33x) 3.0 GB/s (4.83x) 2.3 GB/s (6.31x) 2.3 GB/s (6.50x) 14.7 GB/s (1.00x) 2.4 GB/s (6.10x)
memchr/sherlock/common/tiny1 2.8 GB/s (1.00x) 1370.9 MB/s (2.09x) 1605.0 MB/s (1.78x) 671.5 MB/s (4.26x) 2.6 GB/s (1.09x) 901.4 MB/s (3.17x)
memchr/sherlock/never/huge1 37.6 GB/s (3.13x) 15.0 GB/s (7.85x) 2.2 GB/s (52.59x) 117.6 GB/s (1.00x) 37.7 GB/s (3.12x) 117.6 GB/s (1.00x)
memchr/sherlock/never/small1 18.7 GB/s (1.57x) 10.0 GB/s (2.95x) 2.1 GB/s (14.10x) 29.4 GB/s (1.00x) 14.7 GB/s (2.00x) 29.4 GB/s (1.00x)
memchr/sherlock/never/tiny1 3.8 GB/s (1.00x) 2.7 GB/s (1.41x) 1370.9 MB/s (2.82x) 3.8 GB/s (1.00x) 2.6 GB/s (1.47x) 3.6 GB/s (1.06x)
memchr/sherlock/never/empty1 14.00ns (1.00x) 15.00ns (1.07x) 14.00ns (1.00x) 16.00ns (1.14x) 15.00ns (1.07x) 16.00ns (1.14x)
memchr/sherlock/rare/huge1 34.2 GB/s (2.79x) 14.4 GB/s (6.63x) 2.2 GB/s (42.87x) 93.6 GB/s (1.02x) 37.7 GB/s (2.53x) 95.5 GB/s (1.00x)
memchr/sherlock/rare/small1 19.3 GB/s (1.14x) 9.1 GB/s (2.43x) 2.0 GB/s (10.79x) 22.1 GB/s (1.00x) 14.7 GB/s (1.50x) 20.6 GB/s (1.07x)
memchr/sherlock/rare/tiny1 3.6 GB/s (1.00x) 2.7 GB/s (1.33x) 1370.9 MB/s (2.67x) 2.9 GB/s (1.22x) 2.6 GB/s (1.39x) 2.9 GB/s (1.22x)
memchr/sherlock/uncommon/huge1 11.1 GB/s (3.41x) 4.7 GB/s (7.99x) 1938.9 MB/s (19.92x) 7.1 GB/s (5.29x) 37.7 GB/s (1.00x) 7.4 GB/s (5.13x)
memchr/sherlock/uncommon/small1 17.7 GB/s (1.00x) 7.5 GB/s (2.34x) 2.1 GB/s (8.49x) 9.0 GB/s (1.97x) 14.7 GB/s (1.20x) 9.5 GB/s (1.86x)
memchr/sherlock/uncommon/tiny1 3.2 GB/s (1.00x) 1994.0 MB/s (1.65x) 1687.3 MB/s (1.95x) 1265.5 MB/s (2.60x) 2.6 GB/s (1.25x) 1495.5 MB/s (2.20x)
memchr/sherlock/verycommon/huge1 2.4 GB/s (15.94x) 540.4 MB/s (71.48x) 561.8 MB/s (68.75x) 946.9 MB/s (40.79x) 37.7 GB/s (1.00x) 1028.3 MB/s (37.56x)
memchr/sherlock/verycommon/small1 7.1 GB/s (2.07x) 1688.6 MB/s (8.93x) 1840.8 MB/s (8.19x) 980.2 MB/s (15.38x) 14.7 GB/s (1.00x) 1064.3 MB/s (14.17x)And Engine Version Geometric mean of speed ratios Benchmark count
------ ------- ------------------------------ ---------------
rust/memchr/memchr/onlycount 2.7.4 1.34 15
rust/bufchr/memchr/oneshot 0.0.1 1.97 15
rust/memchr/memchr/prebuilt 2.7.4 2.83 15
rust/memchr/memchr/oneshot 2.7.4 2.97 15
rust/memchr/memchr/fallback 2.7.4 4.49 15
rust/memchr/memchr/naive 2.7.4 8.93 15I made sure to check the benchmarked function returned the correct count of matches as what was found in the definitions files (I need to check whether there is an integrated way to do so using I am only using SSE2 instructions, whereas the Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: GenuineIntel
Model name: 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
CPU family: 6
Model: 140
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Stepping: 1
CPU max MHz: 4200.0000
CPU min MHz: 400.0000
BogoMIPS: 4838.40
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi
mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon
pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulq
dq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2a
pic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch c
puid_fault epb cat_l2 cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority e
pt vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a avx512f avx512d
q rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl x
saveopt xsavec xgetbv1 xsaves split_lock_detect user_shstk dtherm ida arat pln pts hwp hw
p_notify hwp_act_window hwp_epp hwp_pkg_req vnmi avx512vbmi umip pku ospke avx512_vbmi2 g
fni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid movdiri movdir64b fs
rm avx512_vp2intersect md_clear ibt flush_l1d arch_capabilities |
|
Nice work. Note that what you have isn't a one-shot searcher. It is prebuilt. One-shot means the search is completely reconstructed on every call. Your benchmark results look nice. The biggest wins seem to come from benchmarks with lots of matches, which is exactly what I'd expect. Anyway, this is going to require a deeper review from me. And there's still a fair bit of work needed for a proper comparison here. You've hand-coded something outside the |
rust/bufchr/memchr/oneshot benchmark enginerust/bufchr/memchr/prebuilt benchmark engine
You are right indeed. I have renamed the engine accordingly. I will try to benchmark AVX2 now and see whether some strategy aligning pointers yields better results also.
I can try to do so if I can find enough time. |
|
I am trying quick and dirty things on my end related to AVX2 and seem to observe that SSE2 is faster than AVX2. I find this counter-intuitive. Do you happen to know if pointer alignment is more important to get good performance with AVX2 vs. SSE2? |
|
I don't. I'm not really an ISA extension expert. I've generally always seen an improvement when using AVX2 compared to SSE2. I do know that on some hardware AVX2 can result in down-clocking the CPU, but I thought that was on very old CPUs. But it might be worth trying other hardware. My understanding is that, at least on x86-64, pointer alignment doesn't really matter for performance. I've never really been able to detect a measurable difference. But I could be wrong. |
|
That's funny, AVX2 makes my iterator consistently slower (I see some literature online related to L* cache issues with AVX2 in some cases). Also you are right, aligning vector seems pointless on my x86_64, it even hurts performance because the first part of the string must be scanned linearly to align the pointer. benchmark avx2 aligned rust/bufchr/memchr/prebuilt avx2 unaligned rust/bufchr/memchr/prebuilt sse2 aligned rust/bufchr/memchr/prebuilt sse2 unaligned rust/bufchr/memchr/prebuilt
--------- ---------------------------------------- ------------------------------------------ ---------------------------------------- ------------------------------------------
memchr/sherlock/common/huge1 3.1 GB/s (1.07x) 3.2 GB/s (1.05x) 3.0 GB/s (1.13x) 3.4 GB/s (1.00x)
memchr/sherlock/common/small1 5.5 GB/s (2.09x) 6.1 GB/s (1.89x) 7.5 GB/s (1.52x) 11.5 GB/s (1.00x)
memchr/sherlock/common/tiny1 1462.3 MB/s (1.96x) 2.1 GB/s (1.35x) 2.7 GB/s (1.04x) 2.8 GB/s (1.00x)
memchr/sherlock/never/huge1 40.3 GB/s (1.00x) 33.6 GB/s (1.20x) 37.6 GB/s (1.07x) 37.6 GB/s (1.07x)
memchr/sherlock/never/small1 17.2 GB/s (1.12x) 19.3 GB/s (1.00x) 15.5 GB/s (1.25x) 16.3 GB/s (1.19x)
memchr/sherlock/never/tiny1 2.5 GB/s (1.53x) 3.4 GB/s (1.12x) 3.8 GB/s (1.00x) 3.6 GB/s (1.06x)
memchr/sherlock/never/empty1 16.00ns (1.14x) 16.00ns (1.14x) 14.00ns (1.00x) 15.00ns (1.07x)
memchr/sherlock/rare/huge1 37.9 GB/s (1.00x) 34.5 GB/s (1.10x) 34.1 GB/s (1.11x) 34.5 GB/s (1.10x)
memchr/sherlock/rare/small1 18.2 GB/s (1.06x) 18.2 GB/s (1.06x) 14.4 GB/s (1.34x) 19.3 GB/s (1.00x)
memchr/sherlock/rare/tiny1 2.5 GB/s (1.44x) 3.4 GB/s (1.06x) 3.6 GB/s (1.00x) 3.6 GB/s (1.00x)
memchr/sherlock/uncommon/huge1 10.5 GB/s (1.04x) 10.0 GB/s (1.10x) 10.4 GB/s (1.06x) 11.0 GB/s (1.00x)
memchr/sherlock/uncommon/small1 12.1 GB/s (1.50x) 14.4 GB/s (1.26x) 14.1 GB/s (1.29x) 18.2 GB/s (1.00x)
memchr/sherlock/uncommon/tiny1 1880.1 MB/s (1.75x) 2.8 GB/s (1.15x) 3.2 GB/s (1.00x) 3.2 GB/s (1.00x)
memchr/sherlock/verycommon/huge1 2.2 GB/s (1.07x) 2.2 GB/s (1.07x) 2.1 GB/s (1.11x) 2.4 GB/s (1.00x)
memchr/sherlock/verycommon/small1 3.4 GB/s (2.10x) 3.6 GB/s (1.97x) 4.9 GB/s (1.46x) 7.1 GB/s (1.00x)Engine Version Geometric mean of speed ratios Benchmark count
------ ------- ------------------------------ ---------------
sse2 unaligned rust/bufchr/memchr/prebuilt 0.0.1 1.03 15
sse2 aligned rust/bufchr/memchr/prebuilt 0.0.1 1.15 15
avx2 unaligned rust/bufchr/memchr/prebuilt 0.0.1 1.21 15
avx2 aligned rust/bufchr/memchr/prebuilt 0.0.1 1.34 15 |
|
So I tried different things and here is where I am: I benchmarked current benchmark avx2 sse2
--------- ---- ----
memchr/sherlock/common/huge1 2.1 GB/s (1.00x) 1880.0 MB/s (1.13x)
memchr/sherlock/common/small1 2.2 GB/s (1.24x) 2.7 GB/s (1.00x)
memchr/sherlock/common/tiny1 822.5 MB/s (1.19x) 982.1 MB/s (1.00x)
memchr/sherlock/never/huge1 114.5 GB/s (1.00x) 54.1 GB/s (2.12x)
memchr/sherlock/never/small1 28.1 GB/s (1.00x) 25.8 GB/s (1.09x)
memchr/sherlock/never/tiny1 3.6 GB/s (1.12x) 4.0 GB/s (1.00x)
memchr/sherlock/never/empty1 15.00ns (1.07x) 14.00ns (1.00x)
memchr/sherlock/rare/huge1 94.1 GB/s (1.00x) 48.1 GB/s (1.96x)
memchr/sherlock/rare/small1 22.1 GB/s (1.00x) 20.6 GB/s (1.07x)
memchr/sherlock/rare/tiny1 2.8 GB/s (1.15x) 3.2 GB/s (1.00x)
memchr/sherlock/uncommon/huge1 7.9 GB/s (1.00x) 7.4 GB/s (1.05x)
memchr/sherlock/uncommon/small1 8.7 GB/s (1.06x) 9.2 GB/s (1.00x)
memchr/sherlock/uncommon/tiny1 1370.9 MB/s (1.17x) 1605.0 MB/s (1.00x)
memchr/sherlock/verycommon/huge1 928.8 MB/s (1.32x) 1227.2 MB/s (1.00x)
memchr/sherlock/verycommon/small1 945.1 MB/s (1.34x) 1269.0 MB/s (1.00x)This seems to indicate, on my CPU, AVX2 yields some improvement over SSE2. But when using my iterator, I could not find any way to make AVX2 not perform worse than SSE2. I also tried various solution related to alignment and nothing performed better than the simple unaligned iterator. I then tried to optimize my iterator by "thinning" it down and simplifying the inner loop as well as possible and improved its overall performance a little, notably on cases where matches are dense. It was very tedious to achieve as a lot of my transitory solutions, while actually simpler and thinner, led to worse performance, and I had quite a hard time pushing the compiler to produce better assembly code. I will try to bench with two and three needles afterwards to have a better round-view of the problem. |
|
For simplicity sake, I would suggest moving forward with the assumption that AVX2 is faster than SSE2. That should let you focus on the core problem of making the iterator smarter. Then if we get to a point where the smarter iterator is mergeable, we can double back and take a look at AVX2 versus SSE2. |
|
Here is the benchmark for two needles. Same observations seem to hold: benchmark rust/bufchr/memchr2/prebuilt rust/memchr/memchr2 rust/memchr/memchr2/fallback rust/memchr/memchr2/naive
--------- ---------------------------- ------------------- ---------------------------- -------------------------
memchr/sherlock/common/huge2 2.2 GB/s (1.00x) 1330.3 MB/s (1.73x) 465.1 MB/s (4.95x) 693.3 MB/s (3.32x)
memchr/sherlock/common/small2 7.6 GB/s (1.00x) 1353.1 MB/s (5.78x) 2042.7 MB/s (3.83x) 1702.3 MB/s (4.59x)
memchr/sherlock/never/huge2 31.0 GB/s (2.12x) 65.9 GB/s (1.00x) 7.3 GB/s (9.07x) 2.1 GB/s (31.42x)
memchr/sherlock/never/small2 13.4 GB/s (1.84x) 24.7 GB/s (1.00x) 5.8 GB/s (4.24x) 1997.6 MB/s (12.68x)
memchr/sherlock/never/tiny2 3.1 GB/s (1.17x) 3.6 GB/s (1.00x) 2.4 GB/s (1.50x) 1241.6 MB/s (2.94x)
memchr/sherlock/never/empty2 14.00ns (1.00x) 16.00ns (1.14x) 20.00ns (1.43x) 15.00ns (1.07x)
memchr/sherlock/rare/huge2 26.5 GB/s (1.87x) 49.6 GB/s (1.00x) 6.9 GB/s (7.17x) 2.1 GB/s (23.94x)
memchr/sherlock/rare/small2 14.4 GB/s (1.16x) 16.7 GB/s (1.00x) 5.7 GB/s (2.92x) 1907.3 MB/s (8.97x)
memchr/sherlock/rare/tiny2 2.9 GB/s (1.00x) 1645.1 MB/s (1.82x) 1994.0 MB/s (1.50x) 1265.5 MB/s (2.36x)
memchr/sherlock/uncommon/huge2 5.7 GB/s (1.00x) 3.5 GB/s (1.61x) 2.0 GB/s (2.84x) 1506.4 MB/s (3.88x)
memchr/sherlock/uncommon/small2 12.1 GB/s (1.00x) 5.8 GB/s (2.10x) 4.0 GB/s (3.06x) 1591.1 MB/s (7.80x)
memchr/sherlock/uncommon/tiny2 2.4 GB/s (1.00x) 1061.3 MB/s (2.30x) 1400.1 MB/s (1.74x) 1196.4 MB/s (2.04x)
I will try working with this assumption but I will also try a different CPU at some point, because in the current state, my iterator using AVX2 is worse than EDIT: I observe the same problem on another CPU (still Intel though). |
|
Oh that's interesting. I wish I had time to dig into this with you. I suspect there is something wrong. The fact that |
|
Ok, so the explanation seems to be that I am a buffoon that succeeded in improperly enabling the avx2 feature, prompting the compiler to fallback to unoptimized sse2 instructions in the generated code (I am not entirely sure how this is even possible, I would expect the compiler to not compile or produce a UB-producing executable instead of this, but I don't have enough skills in the matter to know what really happened here). Now that I have avx2 instructions properly used, here is what I observe: benchmark avx2 bufchr/memchr/prebuilt avx2-aligned bufchr/memchr/prebuilt memchr/memchr/prebuilt sse2 bufchr/memchr/prebuilt
--------- --------------------------- ----------------------------------- ---------------------- ---------------------------
memchr/sherlock/common/huge1 4.2 GB/s (1.00x) 4.1 GB/s (1.02x) 2.2 GB/s (1.90x) 3.4 GB/s (1.24x)
memchr/sherlock/common/small1 11.5 GB/s (1.04x) 8.4 GB/s (1.42x) 2.4 GB/s (4.87x) 11.9 GB/s (1.00x)
memchr/sherlock/common/tiny1 2.8 GB/s (1.10x) 1827.9 MB/s (1.71x) 901.4 MB/s (3.48x) 3.1 GB/s (1.00x)
memchr/sherlock/never/huge1 41.4 GB/s (2.84x) 45.2 GB/s (2.60x) 117.6 GB/s (1.00x) 37.6 GB/s (3.13x)
memchr/sherlock/never/small1 19.9 GB/s (1.48x) 20.6 GB/s (1.43x) 29.4 GB/s (1.00x) 15.9 GB/s (1.86x)
memchr/sherlock/never/tiny1 3.8 GB/s (1.00x) 2.6 GB/s (1.47x) 3.6 GB/s (1.06x) 3.8 GB/s (1.00x)
memchr/sherlock/never/empty1 14.00ns (1.00x) 15.00ns (1.07x) 16.00ns (1.14x) 14.00ns (1.00x)
memchr/sherlock/rare/huge1 37.1 GB/s (2.57x) 41.5 GB/s (2.30x) 95.5 GB/s (1.00x) 34.1 GB/s (2.80x)
memchr/sherlock/rare/small1 21.3 GB/s (1.12x) 18.2 GB/s (1.31x) 23.8 GB/s (1.00x) 17.7 GB/s (1.35x)
memchr/sherlock/rare/tiny1 3.8 GB/s (1.00x) 2.6 GB/s (1.47x) 2.9 GB/s (1.29x) 3.8 GB/s (1.00x)
memchr/sherlock/uncommon/huge1 14.0 GB/s (1.00x) 11.6 GB/s (1.20x) 7.4 GB/s (1.90x) 11.4 GB/s (1.23x)
memchr/sherlock/uncommon/small1 18.7 GB/s (1.00x) 14.7 GB/s (1.27x) 9.5 GB/s (1.97x) 16.7 GB/s (1.12x)
memchr/sherlock/uncommon/tiny1 3.6 GB/s (1.00x) 2.5 GB/s (1.44x) 1495.5 MB/s (2.44x) 3.6 GB/s (1.00x)
memchr/sherlock/verycommon/huge1 3.3 GB/s (1.00x) 2.6 GB/s (1.26x) 1027.4 MB/s (3.32x) 2.5 GB/s (1.31x)
memchr/sherlock/verycommon/small1 8.0 GB/s (1.07x) 5.2 GB/s (1.64x) 1064.3 MB/s (8.26x) 8.6 GB/s (1.00x)Ranking: Engine Version Geometric mean of speed ratios Benchmark count
------ ------- ------------------------------ ---------------
avx2 bufchr/memchr/prebuilt 0.0.1 1.20 15
sse2 bufchr/memchr/prebuilt 0.0.1 1.30 15
avx2-aligned bufchr/memchr/prebuilt 0.0.1 1.46 15
memchr/memchr/prebuilt 2.7.4 1.88 15So now avx2 is giving an edge, but not by much. Alignement seems to help a little bit with large inputs where the needle is rare, but not that much either. Also When searching for 2 needles (using avx2 properly this time), the discrepancy drops: benchmark rust/bufchr/memchr2/prebuilt rust/memchr/memchr2
--------- ---------------------------- -------------------
memchr/sherlock/common/huge2 3.1 GB/s (1.00x) 1337.4 MB/s (2.34x)
memchr/sherlock/common/small2 9.4 GB/s (1.00x) 1490.0 MB/s (6.44x)
memchr/sherlock/never/huge2 51.9 GB/s (1.27x) 65.9 GB/s (1.00x)
memchr/sherlock/never/small2 19.3 GB/s (1.10x) 21.3 GB/s (1.00x)
memchr/sherlock/never/tiny2 3.6 GB/s (1.00x) 3.6 GB/s (1.00x)
memchr/sherlock/never/empty2 14.00ns (1.00x) 16.00ns (1.14x)
memchr/sherlock/rare/huge2 42.4 GB/s (1.17x) 49.6 GB/s (1.00x)
memchr/sherlock/rare/small2 18.2 GB/s (1.00x) 15.1 GB/s (1.21x)
memchr/sherlock/rare/tiny2 3.1 GB/s (1.00x) 2.2 GB/s (1.38x)
memchr/sherlock/uncommon/huge2 5.8 GB/s (1.00x) 3.5 GB/s (1.63x)
memchr/sherlock/uncommon/small2 15.9 GB/s (1.00x) 5.8 GB/s (2.72x)
memchr/sherlock/uncommon/tiny2 2.9 GB/s (1.00x) 1061.3 MB/s (2.82x)Ranking: Engine Version Geometric mean of speed ratios Benchmark count
------ ------- ------------------------------ ---------------
rust/bufchr/memchr2/prebuilt 0.0.1 1.04 12
rust/memchr/memchr2 2.7.4 1.63 12I'll try and check the 3 needles cases afterwards because this is the use-case I am particularly interested in, then I will try integrating a proper "within-memchr" crate solution in my fork for an apples-to-apples bench with rigorous testing but after the holidays. |
|
Results from the 3 needles case: benchmark rust/bufchr/memchr3/prebuilt rust/memchr/memchr3
--------- ---------------------------- -------------------
memchr/sherlock/common/huge3 2.4 GB/s (1.00x) 861.2 MB/s (2.91x)
memchr/sherlock/common/small3 6.6 GB/s (1.00x) 969.7 MB/s (7.02x)
memchr/sherlock/never/huge3 28.8 GB/s (1.93x) 55.6 GB/s (1.00x)
memchr/sherlock/never/small3 15.5 GB/s (1.43x) 22.1 GB/s (1.00x)
memchr/sherlock/never/tiny3 3.4 GB/s (1.00x) 3.4 GB/s (1.00x)
memchr/sherlock/never/empty3 14.00ns (1.00x) 16.00ns (1.14x)
memchr/sherlock/rare/huge3 27.2 GB/s (1.40x) 38.1 GB/s (1.00x)
memchr/sherlock/rare/small3 12.6 GB/s (1.07x) 13.4 GB/s (1.00x)
memchr/sherlock/rare/tiny3 3.2 GB/s (1.00x) 1935.4 MB/s (1.70x)
memchr/sherlock/uncommon/huge3 4.9 GB/s (1.00x) 2.6 GB/s (1.87x)
memchr/sherlock/uncommon/small3 12.9 GB/s (1.00x) 3.0 GB/s (4.25x)
memchr/sherlock/uncommon/tiny3 2.5 GB/s (1.00x) 626.7 MB/s (4.04x)Here unrolling seems to give an edge once more on huge inputs with rare or never. And here is the related ranking: Engine Version Geometric mean of speed ratios Benchmark count
------ ------- ------------------------------ ---------------
rust/bufchr/memchr3/prebuilt 0.0.1 1.13 12
rust/memchr/memchr3 2.7.4 1.81 12 |
|
Hello @BurntSushi, I have tried to integrate the new iterator logic within
Given that the use-cases where this new iterator is supposed to shine are not that common (at least I only have one case in mind currently, that is csv parsing that needs to search for 3 different bytes repeatedly), I am wondering whether this is a good fit for In the meantime, I have a not-runtime-detected version of the 3 bytes searcher over here. I am using it in a custom Rust csv parser relying on SIMD instructions (the crate is not yet documented but can be found here), as hinted in this other discussion. I am not using the |
|
Thank you for that investigation! I do think it would be great to get this kind of optimization into this crate somehow, but it does seem like quite the challenge unfortunately. |
|
Thinking a little bit more about this, the only practical paradigm I can think of to implement this iteration logic all while performing runtime detection of |
Hello @BurntSushi, here is a dummy PR meant as a conversation opener regarding the inclusion of an iterator for
memchrroutines with different amortization principles (as previously discussed in #184). The idea is to have an iterator for cases when people actually know they will need to scan the whole haystack for matches and want to amortize the vectorized calls along the way because some matches can sometimes be very close to one another.Several notes:
self.start == self.currenttest, but I am not sure.Please feel free to tell me to let the matter rest if you want, I don't want to be a bother.