Skip to content

Conversation

@Yomguithereal
Copy link

Hello @BurntSushi, here is a dummy PR meant as a conversation opener regarding the inclusion of an iterator for memchr routines with different amortization principles (as previously discussed in #184). The idea is to have an iterator for cases when people actually know they will need to scan the whole haystack for matches and want to amortize the vectorized calls along the way because some matches can sometimes be very close to one another.

Several notes:

  • I am not sure the boolean tracking whether the first unaligned load was performed is actually required. I think it can be inferred from a self.start == self.current test, but I am not sure.
  • The trick about the last unaligned load is not possible in this iterator's context since its invariant does not hold, since we may stumble upon a match reported earlier (we could store an offset, though, but is it worth the cost?).
  • The iterator does not perform any unrolling because it would probably mean storing more than one mask at a time.

Please feel free to tell me to let the matter rest if you want, I don't want to be a bother.

@BurntSushi
Copy link
Owner

I think before I'm willing to expand the API and make it more complicated, there needs to be more benchmarking:

  1. The status quo
  2. Changing the current iterator to support this sort of amortization
  3. Providing both options

This would give us concrete numbers to weigh against the API complexity of more iterators.

@Yomguithereal
Copy link
Author

Hello @BurntSushi, I have started an early benchmark over there to address the question: https://github.com/Yomguithereal/memchr-iter-bench

I am new to the "benchmarking in Rust" game so I used criterion, I hope this is a sensible choice. The results (the formatting is a bit silly sorry, the only thing I found to quickly render the results as a table is criterion-table) are here.

I benchmarked a naive loop of memchr, an amortized loop of memchr (the splat is only created once), memchr_iter and the proposed (fixed, lol the PR code is wildly incorrect since I was unable to test it) iterator (I will refer to it as memchr_memoized later on), all against a linear scalar baseline. I benchmarked all those options on strings ranging from short to long, also varying the density of matches found within it.

I only benchmarked sse2 instructions currently because I felt they would yield the least performance vs. other SIMD implementations and the 16 bytes step should hurt memchr_memoized more, in principle.

I only benchmarked the case where you are searching for a single needle yet.

Also, the use case here is to report all matches, not only the first one.

Some things I seem to observe:

  • Amortizing a single splat vector creation seems to be pointless
  • memchr_iter is usually a tiny bit faster than the naive or amortized memchr loop, as you hinted at
  • memchr_memoized is usually competitive when strings are long and when they are expected to contain multiple needles, the denser the better.
  • On the degenerate case where the haystack only contains needles or nearly only needles, memchr_memoized does a better job than memchr etc, of course, but is still worse than the scalar solution, which makes sense

It should also be noted that, for now, memchr_memoized does not try to load aligned pointer at all, notably because the iterator logic and the fact you need to report all matches make the clever unaligned tricks awkward since you need to avoid overlapping which would mean reporting multiple times the same match. There are of course ways around that, in order to load aligned pointers but I have not worked toward this yet.

So I think memchr_memoized has good value for a specific use-case but still represent some kind of trade-off based on what you need to do, which means it is probably not a good fit for a replacement of memchr_iter, right?

@BurntSushi
Copy link
Owner

The benchmarks should use rebar. There's already a huge pile of benchmarks in this repo that cover a broad sampling of use cases.

@Yomguithereal Yomguithereal changed the title arch::generic::memchr::OneMatches iterator proof-of-concept Adding rust/bufchr/memchr/oneshot benchmark engine Jul 28, 2025
@Yomguithereal
Copy link
Author

I have pivoted the PR to be adding a benchmark engine to the repo using rebar as you requested. The engine name is rust/bufchr/memchr/oneshot (I went with bufchr as a reference to #88).

I built using the following command:

rebar build -e '^rust/(memchr|bufchr)/memchr/(oneshot|prebuilt|onlycount|fallback|naive)$'

So I ran the benchmarks for the following engines:

rust/bufchr/memchr/oneshot
rust/memchr/memchr/fallback
rust/memchr/memchr/naive
rust/memchr/memchr/oneshot
rust/memchr/memchr/onlycount
rust/memchr/memchr/prebuilt

Using the following command:

rebar measure -e '^rust/(memchr|bufchr)/memchr/(oneshot|prebuilt|onlycount|fallback|naive)$' | tee bench.csv

I got the following results: bench.csv

rebar cmp gives me the following:

benchmark                          rust/bufchr/memchr/oneshot  rust/memchr/memchr/fallback  rust/memchr/memchr/naive  rust/memchr/memchr/oneshot  rust/memchr/memchr/onlycount  rust/memchr/memchr/prebuilt
---------                          --------------------------  ---------------------------  ------------------------  --------------------------  ----------------------------  ---------------------------
memchr/sherlock/common/huge1       3.3 GB/s (11.32x)           940.1 MB/s (41.09x)          1137.0 MB/s (33.97x)      2.1 GB/s (18.06x)           37.7 GB/s (1.00x)             2.2 GB/s (17.04x)
memchr/sherlock/common/small1      11.0 GB/s (1.33x)           3.0 GB/s (4.83x)             2.3 GB/s (6.31x)          2.3 GB/s (6.50x)            14.7 GB/s (1.00x)             2.4 GB/s (6.10x)
memchr/sherlock/common/tiny1       2.8 GB/s (1.00x)            1370.9 MB/s (2.09x)          1605.0 MB/s (1.78x)       671.5 MB/s (4.26x)          2.6 GB/s (1.09x)              901.4 MB/s (3.17x)
memchr/sherlock/never/huge1        37.6 GB/s (3.13x)           15.0 GB/s (7.85x)            2.2 GB/s (52.59x)         117.6 GB/s (1.00x)          37.7 GB/s (3.12x)             117.6 GB/s (1.00x)
memchr/sherlock/never/small1       18.7 GB/s (1.57x)           10.0 GB/s (2.95x)            2.1 GB/s (14.10x)         29.4 GB/s (1.00x)           14.7 GB/s (2.00x)             29.4 GB/s (1.00x)
memchr/sherlock/never/tiny1        3.8 GB/s (1.00x)            2.7 GB/s (1.41x)             1370.9 MB/s (2.82x)       3.8 GB/s (1.00x)            2.6 GB/s (1.47x)              3.6 GB/s (1.06x)
memchr/sherlock/never/empty1       14.00ns (1.00x)             15.00ns (1.07x)              14.00ns (1.00x)           16.00ns (1.14x)             15.00ns (1.07x)               16.00ns (1.14x)
memchr/sherlock/rare/huge1         34.2 GB/s (2.79x)           14.4 GB/s (6.63x)            2.2 GB/s (42.87x)         93.6 GB/s (1.02x)           37.7 GB/s (2.53x)             95.5 GB/s (1.00x)
memchr/sherlock/rare/small1        19.3 GB/s (1.14x)           9.1 GB/s (2.43x)             2.0 GB/s (10.79x)         22.1 GB/s (1.00x)           14.7 GB/s (1.50x)             20.6 GB/s (1.07x)
memchr/sherlock/rare/tiny1         3.6 GB/s (1.00x)            2.7 GB/s (1.33x)             1370.9 MB/s (2.67x)       2.9 GB/s (1.22x)            2.6 GB/s (1.39x)              2.9 GB/s (1.22x)
memchr/sherlock/uncommon/huge1     11.1 GB/s (3.41x)           4.7 GB/s (7.99x)             1938.9 MB/s (19.92x)      7.1 GB/s (5.29x)            37.7 GB/s (1.00x)             7.4 GB/s (5.13x)
memchr/sherlock/uncommon/small1    17.7 GB/s (1.00x)           7.5 GB/s (2.34x)             2.1 GB/s (8.49x)          9.0 GB/s (1.97x)            14.7 GB/s (1.20x)             9.5 GB/s (1.86x)
memchr/sherlock/uncommon/tiny1     3.2 GB/s (1.00x)            1994.0 MB/s (1.65x)          1687.3 MB/s (1.95x)       1265.5 MB/s (2.60x)         2.6 GB/s (1.25x)              1495.5 MB/s (2.20x)
memchr/sherlock/verycommon/huge1   2.4 GB/s (15.94x)           540.4 MB/s (71.48x)          561.8 MB/s (68.75x)       946.9 MB/s (40.79x)         37.7 GB/s (1.00x)             1028.3 MB/s (37.56x)
memchr/sherlock/verycommon/small1  7.1 GB/s (2.07x)            1688.6 MB/s (8.93x)          1840.8 MB/s (8.19x)       980.2 MB/s (15.38x)         14.7 GB/s (1.00x)             1064.3 MB/s (14.17x)

And rebar rank gives me the following:

Engine                        Version  Geometric mean of speed ratios  Benchmark count
------                        -------  ------------------------------  ---------------
rust/memchr/memchr/onlycount  2.7.4    1.34                            15
rust/bufchr/memchr/oneshot    0.0.1    1.97                            15
rust/memchr/memchr/prebuilt   2.7.4    2.83                            15
rust/memchr/memchr/oneshot    2.7.4    2.97                            15
rust/memchr/memchr/fallback   2.7.4    4.49                            15
rust/memchr/memchr/naive      2.7.4    8.93                            15

I made sure to check the benchmarked function returned the correct count of matches as what was found in the definitions files (I need to check whether there is an integrated way to do so using rebar measure --test).

I am only using SSE2 instructions, whereas the rust-memchr engine was supposedly using AVX2. Here is what lscpu gives me up to flags:

Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          39 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   8
  On-line CPU(s) list:    0-7
Vendor ID:                GenuineIntel
  Model name:             11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
    CPU family:           6
    Model:                140
    Thread(s) per core:   2
    Core(s) per socket:   4
    Socket(s):            1
    Stepping:             1
    CPU max MHz:          4200.0000
    CPU min MHz:          400.0000
    BogoMIPS:             4838.40
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi 
                          mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon
                           pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulq
                          dq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2a
                          pic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch c
                          puid_fault epb cat_l2 cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority e
                          pt vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a avx512f avx512d
                          q rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl x
                          saveopt xsavec xgetbv1 xsaves split_lock_detect user_shstk dtherm ida arat pln pts hwp hw
                          p_notify hwp_act_window hwp_epp hwp_pkg_req vnmi avx512vbmi umip pku ospke avx512_vbmi2 g
                          fni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid movdiri movdir64b fs
                          rm avx512_vp2intersect md_clear ibt flush_l1d arch_capabilities

@BurntSushi
Copy link
Owner

Nice work. Note that what you have isn't a one-shot searcher. It is prebuilt. One-shot means the search is completely reconstructed on every call.

Your benchmark results look nice. The biggest wins seem to come from benchmarks with lots of matches, which is exactly what I'd expect.

Anyway, this is going to require a deeper review from me. And there's still a fair bit of work needed for a proper comparison here. You've hand-coded something outside the memchr crate, but there's infrastructure that's missing (particularly around CPU feature detection). Ideally what the comparison would look like is the status quo versus a memchr crate whose memchr iterators work like yours. i.e., No API differences. That would be the best apples-to-apples comparison.

@Yomguithereal Yomguithereal changed the title Adding rust/bufchr/memchr/oneshot benchmark engine Adding rust/bufchr/memchr/prebuilt benchmark engine Jul 28, 2025
@Yomguithereal
Copy link
Author

Nice work. Note that what you have isn't a one-shot searcher. It is prebuilt. One-shot means the search is completely reconstructed on every call.

You are right indeed. I have renamed the engine accordingly.

I will try to benchmark AVX2 now and see whether some strategy aligning pointers yields better results also.

Anyway, this is going to require a deeper review from me. And there's still a fair bit of work needed for a proper comparison here. You've hand-coded something outside the memchr crate, but there's infrastructure that's missing (particularly around CPU feature detection). Ideally what the comparison would look like is the status quo versus a memchr crate whose memchr iterators work like yours. i.e., No API differences. That would be the best apples-to-apples comparison.

I can try to do so if I can find enough time.

@Yomguithereal
Copy link
Author

I am trying quick and dirty things on my end related to AVX2 and seem to observe that SSE2 is faster than AVX2. I find this counter-intuitive. Do you happen to know if pointer alignment is more important to get good performance with AVX2 vs. SSE2?

@BurntSushi
Copy link
Owner

I don't. I'm not really an ISA extension expert. I've generally always seen an improvement when using AVX2 compared to SSE2.

I do know that on some hardware AVX2 can result in down-clocking the CPU, but I thought that was on very old CPUs. But it might be worth trying other hardware.

My understanding is that, at least on x86-64, pointer alignment doesn't really matter for performance. I've never really been able to detect a measurable difference. But I could be wrong.

@Yomguithereal
Copy link
Author

Yomguithereal commented Jul 28, 2025

That's funny, AVX2 makes my iterator consistently slower (I see some literature online related to L* cache issues with AVX2 in some cases). Also you are right, aligning vector seems pointless on my x86_64, it even hurts performance because the first part of the string must be scanned linearly to align the pointer.

benchmark                          avx2 aligned rust/bufchr/memchr/prebuilt  avx2 unaligned rust/bufchr/memchr/prebuilt  sse2 aligned rust/bufchr/memchr/prebuilt  sse2 unaligned rust/bufchr/memchr/prebuilt
---------                          ----------------------------------------  ------------------------------------------  ----------------------------------------  ------------------------------------------
memchr/sherlock/common/huge1       3.1 GB/s (1.07x)                          3.2 GB/s (1.05x)                            3.0 GB/s (1.13x)                          3.4 GB/s (1.00x)
memchr/sherlock/common/small1      5.5 GB/s (2.09x)                          6.1 GB/s (1.89x)                            7.5 GB/s (1.52x)                          11.5 GB/s (1.00x)
memchr/sherlock/common/tiny1       1462.3 MB/s (1.96x)                       2.1 GB/s (1.35x)                            2.7 GB/s (1.04x)                          2.8 GB/s (1.00x)
memchr/sherlock/never/huge1        40.3 GB/s (1.00x)                         33.6 GB/s (1.20x)                           37.6 GB/s (1.07x)                         37.6 GB/s (1.07x)
memchr/sherlock/never/small1       17.2 GB/s (1.12x)                         19.3 GB/s (1.00x)                           15.5 GB/s (1.25x)                         16.3 GB/s (1.19x)
memchr/sherlock/never/tiny1        2.5 GB/s (1.53x)                          3.4 GB/s (1.12x)                            3.8 GB/s (1.00x)                          3.6 GB/s (1.06x)
memchr/sherlock/never/empty1       16.00ns (1.14x)                           16.00ns (1.14x)                             14.00ns (1.00x)                           15.00ns (1.07x)
memchr/sherlock/rare/huge1         37.9 GB/s (1.00x)                         34.5 GB/s (1.10x)                           34.1 GB/s (1.11x)                         34.5 GB/s (1.10x)
memchr/sherlock/rare/small1        18.2 GB/s (1.06x)                         18.2 GB/s (1.06x)                           14.4 GB/s (1.34x)                         19.3 GB/s (1.00x)
memchr/sherlock/rare/tiny1         2.5 GB/s (1.44x)                          3.4 GB/s (1.06x)                            3.6 GB/s (1.00x)                          3.6 GB/s (1.00x)
memchr/sherlock/uncommon/huge1     10.5 GB/s (1.04x)                         10.0 GB/s (1.10x)                           10.4 GB/s (1.06x)                         11.0 GB/s (1.00x)
memchr/sherlock/uncommon/small1    12.1 GB/s (1.50x)                         14.4 GB/s (1.26x)                           14.1 GB/s (1.29x)                         18.2 GB/s (1.00x)
memchr/sherlock/uncommon/tiny1     1880.1 MB/s (1.75x)                       2.8 GB/s (1.15x)                            3.2 GB/s (1.00x)                          3.2 GB/s (1.00x)
memchr/sherlock/verycommon/huge1   2.2 GB/s (1.07x)                          2.2 GB/s (1.07x)                            2.1 GB/s (1.11x)                          2.4 GB/s (1.00x)
memchr/sherlock/verycommon/small1  3.4 GB/s (2.10x)                          3.6 GB/s (1.97x)                            4.9 GB/s (1.46x)                          7.1 GB/s (1.00x)
Engine                                      Version  Geometric mean of speed ratios  Benchmark count
------                                      -------  ------------------------------  ---------------
sse2 unaligned rust/bufchr/memchr/prebuilt  0.0.1    1.03                            15
sse2 aligned rust/bufchr/memchr/prebuilt    0.0.1    1.15                            15
avx2 unaligned rust/bufchr/memchr/prebuilt  0.0.1    1.21                            15
avx2 aligned rust/bufchr/memchr/prebuilt    0.0.1    1.34                            15

@Yomguithereal
Copy link
Author

Yomguithereal commented Jul 29, 2025

So I tried different things and here is where I am:

I benchmarked current memchr using SSE2 vs AVX2 and found results matching intuition:

benchmark                          avx2                 sse2
---------                          ----                 ----
memchr/sherlock/common/huge1       2.1 GB/s (1.00x)     1880.0 MB/s (1.13x)
memchr/sherlock/common/small1      2.2 GB/s (1.24x)     2.7 GB/s (1.00x)
memchr/sherlock/common/tiny1       822.5 MB/s (1.19x)   982.1 MB/s (1.00x)
memchr/sherlock/never/huge1        114.5 GB/s (1.00x)   54.1 GB/s (2.12x)
memchr/sherlock/never/small1       28.1 GB/s (1.00x)    25.8 GB/s (1.09x)
memchr/sherlock/never/tiny1        3.6 GB/s (1.12x)     4.0 GB/s (1.00x)
memchr/sherlock/never/empty1       15.00ns (1.07x)      14.00ns (1.00x)
memchr/sherlock/rare/huge1         94.1 GB/s (1.00x)    48.1 GB/s (1.96x)
memchr/sherlock/rare/small1        22.1 GB/s (1.00x)    20.6 GB/s (1.07x)
memchr/sherlock/rare/tiny1         2.8 GB/s (1.15x)     3.2 GB/s (1.00x)
memchr/sherlock/uncommon/huge1     7.9 GB/s (1.00x)     7.4 GB/s (1.05x)
memchr/sherlock/uncommon/small1    8.7 GB/s (1.06x)     9.2 GB/s (1.00x)
memchr/sherlock/uncommon/tiny1     1370.9 MB/s (1.17x)  1605.0 MB/s (1.00x)
memchr/sherlock/verycommon/huge1   928.8 MB/s (1.32x)   1227.2 MB/s (1.00x)
memchr/sherlock/verycommon/small1  945.1 MB/s (1.34x)   1269.0 MB/s (1.00x)

This seems to indicate, on my CPU, AVX2 yields some improvement over SSE2. But when using my iterator, I could not find any way to make AVX2 not perform worse than SSE2.

I also tried various solution related to alignment and nothing performed better than the simple unaligned iterator.

I then tried to optimize my iterator by "thinning" it down and simplifying the inner loop as well as possible and improved its overall performance a little, notably on cases where matches are dense. It was very tedious to achieve as a lot of my transitory solutions, while actually simpler and thinner, led to worse performance, and I had quite a hard time pushing the compiler to produce better assembly code.

I will try to bench with two and three needles afterwards to have a better round-view of the problem.

@BurntSushi
Copy link
Owner

For simplicity sake, I would suggest moving forward with the assumption that AVX2 is faster than SSE2. That should let you focus on the core problem of making the iterator smarter. Then if we get to a point where the smarter iterator is mergeable, we can double back and take a look at AVX2 versus SSE2.

@Yomguithereal
Copy link
Author

Yomguithereal commented Jul 29, 2025

Here is the benchmark for two needles. Same observations seem to hold:

benchmark                        rust/bufchr/memchr2/prebuilt  rust/memchr/memchr2  rust/memchr/memchr2/fallback  rust/memchr/memchr2/naive
---------                        ----------------------------  -------------------  ----------------------------  -------------------------
memchr/sherlock/common/huge2     2.2 GB/s (1.00x)              1330.3 MB/s (1.73x)  465.1 MB/s (4.95x)            693.3 MB/s (3.32x)
memchr/sherlock/common/small2    7.6 GB/s (1.00x)              1353.1 MB/s (5.78x)  2042.7 MB/s (3.83x)           1702.3 MB/s (4.59x)
memchr/sherlock/never/huge2      31.0 GB/s (2.12x)             65.9 GB/s (1.00x)    7.3 GB/s (9.07x)              2.1 GB/s (31.42x)
memchr/sherlock/never/small2     13.4 GB/s (1.84x)             24.7 GB/s (1.00x)    5.8 GB/s (4.24x)              1997.6 MB/s (12.68x)
memchr/sherlock/never/tiny2      3.1 GB/s (1.17x)              3.6 GB/s (1.00x)     2.4 GB/s (1.50x)              1241.6 MB/s (2.94x)
memchr/sherlock/never/empty2     14.00ns (1.00x)               16.00ns (1.14x)      20.00ns (1.43x)               15.00ns (1.07x)
memchr/sherlock/rare/huge2       26.5 GB/s (1.87x)             49.6 GB/s (1.00x)    6.9 GB/s (7.17x)              2.1 GB/s (23.94x)
memchr/sherlock/rare/small2      14.4 GB/s (1.16x)             16.7 GB/s (1.00x)    5.7 GB/s (2.92x)              1907.3 MB/s (8.97x)
memchr/sherlock/rare/tiny2       2.9 GB/s (1.00x)              1645.1 MB/s (1.82x)  1994.0 MB/s (1.50x)           1265.5 MB/s (2.36x)
memchr/sherlock/uncommon/huge2   5.7 GB/s (1.00x)              3.5 GB/s (1.61x)     2.0 GB/s (2.84x)              1506.4 MB/s (3.88x)
memchr/sherlock/uncommon/small2  12.1 GB/s (1.00x)             5.8 GB/s (2.10x)     4.0 GB/s (3.06x)              1591.1 MB/s (7.80x)
memchr/sherlock/uncommon/tiny2   2.4 GB/s (1.00x)              1061.3 MB/s (2.30x)  1400.1 MB/s (1.74x)           1196.4 MB/s (2.04x)

For simplicity sake, I would suggest moving forward with the assumption that AVX2 is faster than SSE2.

I will try working with this assumption but I will also try a different CPU at some point, because in the current state, my iterator using AVX2 is worse than memchr_iter. In fact, this seems to be what is happening: my iterator with SSE2 > memchr_iter > my iterator using AVX2.

EDIT: I observe the same problem on another CPU (still Intel though).

@BurntSushi
Copy link
Owner

Oh that's interesting. I wish I had time to dig into this with you. I suspect there is something wrong. The fact that memchr/sherlock/never/huge2 is twice as slow with bufchr suggests something is pretty wrong. That benchmark shouldn't rely at all on the iterator optimization since there's never any matches. If I were working on this, I'd look toward carefully scrutinizing the codegen.

@Yomguithereal
Copy link
Author

Ok, so the explanation seems to be that I am a buffoon that succeeded in improperly enabling the avx2 feature, prompting the compiler to fallback to unoptimized sse2 instructions in the generated code (I am not entirely sure how this is even possible, I would expect the compiler to not compile or produce a UB-producing executable instead of this, but I don't have enough skills in the matter to know what really happened here).

Now that I have avx2 instructions properly used, here is what I observe:

benchmark                          avx2 bufchr/memchr/prebuilt  avx2-aligned bufchr/memchr/prebuilt  memchr/memchr/prebuilt  sse2 bufchr/memchr/prebuilt
---------                          ---------------------------  -----------------------------------  ----------------------  ---------------------------
memchr/sherlock/common/huge1       4.2 GB/s (1.00x)             4.1 GB/s (1.02x)                     2.2 GB/s (1.90x)        3.4 GB/s (1.24x)
memchr/sherlock/common/small1      11.5 GB/s (1.04x)            8.4 GB/s (1.42x)                     2.4 GB/s (4.87x)        11.9 GB/s (1.00x)
memchr/sherlock/common/tiny1       2.8 GB/s (1.10x)             1827.9 MB/s (1.71x)                  901.4 MB/s (3.48x)      3.1 GB/s (1.00x)
memchr/sherlock/never/huge1        41.4 GB/s (2.84x)            45.2 GB/s (2.60x)                    117.6 GB/s (1.00x)      37.6 GB/s (3.13x)
memchr/sherlock/never/small1       19.9 GB/s (1.48x)            20.6 GB/s (1.43x)                    29.4 GB/s (1.00x)       15.9 GB/s (1.86x)
memchr/sherlock/never/tiny1        3.8 GB/s (1.00x)             2.6 GB/s (1.47x)                     3.6 GB/s (1.06x)        3.8 GB/s (1.00x)
memchr/sherlock/never/empty1       14.00ns (1.00x)              15.00ns (1.07x)                      16.00ns (1.14x)         14.00ns (1.00x)
memchr/sherlock/rare/huge1         37.1 GB/s (2.57x)            41.5 GB/s (2.30x)                    95.5 GB/s (1.00x)       34.1 GB/s (2.80x)
memchr/sherlock/rare/small1        21.3 GB/s (1.12x)            18.2 GB/s (1.31x)                    23.8 GB/s (1.00x)       17.7 GB/s (1.35x)
memchr/sherlock/rare/tiny1         3.8 GB/s (1.00x)             2.6 GB/s (1.47x)                     2.9 GB/s (1.29x)        3.8 GB/s (1.00x)
memchr/sherlock/uncommon/huge1     14.0 GB/s (1.00x)            11.6 GB/s (1.20x)                    7.4 GB/s (1.90x)        11.4 GB/s (1.23x)
memchr/sherlock/uncommon/small1    18.7 GB/s (1.00x)            14.7 GB/s (1.27x)                    9.5 GB/s (1.97x)        16.7 GB/s (1.12x)
memchr/sherlock/uncommon/tiny1     3.6 GB/s (1.00x)             2.5 GB/s (1.44x)                     1495.5 MB/s (2.44x)     3.6 GB/s (1.00x)
memchr/sherlock/verycommon/huge1   3.3 GB/s (1.00x)             2.6 GB/s (1.26x)                     1027.4 MB/s (3.32x)     2.5 GB/s (1.31x)
memchr/sherlock/verycommon/small1  8.0 GB/s (1.07x)             5.2 GB/s (1.64x)                     1064.3 MB/s (8.26x)     8.6 GB/s (1.00x)

Ranking:

Engine                               Version  Geometric mean of speed ratios  Benchmark count
------                               -------  ------------------------------  ---------------
avx2 bufchr/memchr/prebuilt          0.0.1    1.20                            15
sse2 bufchr/memchr/prebuilt          0.0.1    1.30                            15
avx2-aligned bufchr/memchr/prebuilt  0.0.1    1.46                            15
memchr/memchr/prebuilt               2.7.4    1.88                            15

So now avx2 is giving an edge, but not by much. Alignement seems to help a little bit with large inputs where the needle is rare, but not that much either. Also memchr is stil 2-3 times faster in this particular use-case, which is still a bit odd because I don't expect the loop unrolling to give such an edge?

When searching for 2 needles (using avx2 properly this time), the discrepancy drops:

benchmark                        rust/bufchr/memchr2/prebuilt  rust/memchr/memchr2
---------                        ----------------------------  -------------------
memchr/sherlock/common/huge2     3.1 GB/s (1.00x)              1337.4 MB/s (2.34x)
memchr/sherlock/common/small2    9.4 GB/s (1.00x)              1490.0 MB/s (6.44x)
memchr/sherlock/never/huge2      51.9 GB/s (1.27x)             65.9 GB/s (1.00x)
memchr/sherlock/never/small2     19.3 GB/s (1.10x)             21.3 GB/s (1.00x)
memchr/sherlock/never/tiny2      3.6 GB/s (1.00x)              3.6 GB/s (1.00x)
memchr/sherlock/never/empty2     14.00ns (1.00x)               16.00ns (1.14x)
memchr/sherlock/rare/huge2       42.4 GB/s (1.17x)             49.6 GB/s (1.00x)
memchr/sherlock/rare/small2      18.2 GB/s (1.00x)             15.1 GB/s (1.21x)
memchr/sherlock/rare/tiny2       3.1 GB/s (1.00x)              2.2 GB/s (1.38x)
memchr/sherlock/uncommon/huge2   5.8 GB/s (1.00x)              3.5 GB/s (1.63x)
memchr/sherlock/uncommon/small2  15.9 GB/s (1.00x)             5.8 GB/s (2.72x)
memchr/sherlock/uncommon/tiny2   2.9 GB/s (1.00x)              1061.3 MB/s (2.82x)

Ranking:

Engine                        Version  Geometric mean of speed ratios  Benchmark count
------                        -------  ------------------------------  ---------------
rust/bufchr/memchr2/prebuilt  0.0.1    1.04                            12
rust/memchr/memchr2           2.7.4    1.63                            12

I'll try and check the 3 needles cases afterwards because this is the use-case I am particularly interested in, then I will try integrating a proper "within-memchr" crate solution in my fork for an apples-to-apples bench with rigorous testing but after the holidays.

@Yomguithereal
Copy link
Author

Results from the 3 needles case:

benchmark                        rust/bufchr/memchr3/prebuilt  rust/memchr/memchr3
---------                        ----------------------------  -------------------
memchr/sherlock/common/huge3     2.4 GB/s (1.00x)              861.2 MB/s (2.91x)
memchr/sherlock/common/small3    6.6 GB/s (1.00x)              969.7 MB/s (7.02x)
memchr/sherlock/never/huge3      28.8 GB/s (1.93x)             55.6 GB/s (1.00x)
memchr/sherlock/never/small3     15.5 GB/s (1.43x)             22.1 GB/s (1.00x)
memchr/sherlock/never/tiny3      3.4 GB/s (1.00x)              3.4 GB/s (1.00x)
memchr/sherlock/never/empty3     14.00ns (1.00x)               16.00ns (1.14x)
memchr/sherlock/rare/huge3       27.2 GB/s (1.40x)             38.1 GB/s (1.00x)
memchr/sherlock/rare/small3      12.6 GB/s (1.07x)             13.4 GB/s (1.00x)
memchr/sherlock/rare/tiny3       3.2 GB/s (1.00x)              1935.4 MB/s (1.70x)
memchr/sherlock/uncommon/huge3   4.9 GB/s (1.00x)              2.6 GB/s (1.87x)
memchr/sherlock/uncommon/small3  12.9 GB/s (1.00x)             3.0 GB/s (4.25x)
memchr/sherlock/uncommon/tiny3   2.5 GB/s (1.00x)              626.7 MB/s (4.04x)

Here unrolling seems to give an edge once more on huge inputs with rare or never.

And here is the related ranking:

Engine                        Version  Geometric mean of speed ratios  Benchmark count
------                        -------  ------------------------------  ---------------
rust/bufchr/memchr3/prebuilt  0.0.1    1.13                            12
rust/memchr/memchr3           2.7.4    1.81                            12

@Yomguithereal
Copy link
Author

Hello @BurntSushi, I have tried to integrate the new iterator logic within memchr itself to provide an apples-to-apples comparison as you suggested but I encountered two roadblocks:

  1. to make sure not to break current API, the new iterator would need to implement DoubleEndedIterator. This means the iterator methods must do some additional bookkeeping to know when it must erase a currently held mask, typically if someone iterate forward, then backward etc.. I am afraid this would increase memory usage of the iterator as well as requiring another condition at the beginning of the methods.
  2. the new iterator has a more complex (and generic) state to keep track of and I am unsure how to make this fit in the current implementation relying on a generic iterator able to work using only a provided generic callback. This makes runtime detection quite a hassle to get right in the case of the new iterator.

Given that the use-cases where this new iterator is supposed to shine are not that common (at least I only have one case in mind currently, that is csv parsing that needs to search for 3 different bytes repeatedly), I am wondering whether this is a good fit for memchr, since we both probably don't have the available bandwidth to make this happen.

In the meantime, I have a not-runtime-detected version of the 3 bytes searcher over here. I am using it in a custom Rust csv parser relying on SIMD instructions (the crate is not yet documented but can be found here), as hinted in this other discussion. I am not using the simdjson tricks described here because I am not clever enough for pclmulqdq and because the memchr routine, as amortized by the new iterator are good enough to get a significant boost in performance. I have multiple versions of a basic csv reader (that will never cover all the ground of the csv crate), able to either delimit the rows, split them in a zero-copy fashion, recording or not the positions of the delimiters, and a version that correctly decodes the data. The performance boost (on data big enough for a measurement to be accurate) is between 1.2 to 8 times faster than linear scanning, which is good enough for my use-cases. Basically, the longer your cells, the greater the performance boost as you hinted at earlier, but we don't pay the cost of restarting the memchr routine when searched bytes are too close to one another.

@BurntSushi
Copy link
Owner

Thank you for that investigation!

I do think it would be great to get this kind of optimization into this crate somehow, but it does seem like quite the challenge unfortunately.

@Yomguithereal
Copy link
Author

Thinking a little bit more about this, the only practical paradigm I can think of to implement this iteration logic all while performing runtime detection of avx2 is callback iteration. Something like for_each<F: FnMut>.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants