Skip to content

Conversation

@adamreichold
Copy link

This does not appear to make a difference in the rust/memchr/memchr(2|3)$ benchmarks running on x86-64 with AVX2. I would argue that it does make for simpler code though.

I suspect there is a reason for this structure and did dig through the Git history but could not find anything. (I ended up at the big "rewrite everything" commit.) At least on x86-64, I would expect fewer "movemask" instructions to be preferable due to their high latency?

@adamreichold
Copy link
Author

Sorry for forgetting to run the tests before pushing. I obviously misunderstood something...

@adamreichold
Copy link
Author

Ok, mixed up the order in the reverse find routine. Will rerun the benchmarks to see if anything changed now...

@adamreichold
Copy link
Author

Ok, mixed up the order in the reverse find routine. Will rerun the benchmarks to see if anything changed now...

Still appears to be within the noise present on my system:

benchmark                        engine               before.csv           after.csv
---------                        ------               ----------           ---------
memchr/sherlock/common/huge2     rust/memchr/memchr2  1634.8 MB/s (1.00x)  1632.5 MB/s (1.00x)
memchr/sherlock/common/huge3     rust/memchr/memchr3  1180.1 MB/s (1.00x)  1107.0 MB/s (1.07x)
memchr/sherlock/common/small2    rust/memchr/memchr2  1706.8 MB/s (1.00x)  1666.4 MB/s (1.02x)
memchr/sherlock/common/small3    rust/memchr/memchr3  1239.2 MB/s (1.00x)  1168.3 MB/s (1.06x)
memchr/sherlock/never/huge2      rust/memchr/memchr2  85.4 GB/s (1.03x)    88.1 GB/s (1.00x)
memchr/sherlock/never/huge3      rust/memchr/memchr3  73.5 GB/s (1.00x)    72.7 GB/s (1.01x)
memchr/sherlock/never/small2     rust/memchr/memchr2  20.6 GB/s (1.00x)    20.6 GB/s (1.00x)
memchr/sherlock/never/small3     rust/memchr/memchr3  20.6 GB/s (1.00x)    20.6 GB/s (1.00x)
memchr/sherlock/never/tiny2      rust/memchr/memchr2  3.2 GB/s (1.00x)     3.2 GB/s (1.00x)
memchr/sherlock/never/tiny3      rust/memchr/memchr3  3.2 GB/s (1.00x)     3.2 GB/s (1.00x)
memchr/sherlock/never/empty2     rust/memchr/memchr2  20.00ns (1.00x)      20.00ns (1.00x)
memchr/sherlock/never/empty3     rust/memchr/memchr3  20.00ns (1.00x)      20.00ns (1.00x)
memchr/sherlock/rare/huge2       rust/memchr/memchr2  59.1 GB/s (1.00x)    58.3 GB/s (1.01x)
memchr/sherlock/rare/huge3       rust/memchr/memchr3  49.4 GB/s (1.01x)    49.8 GB/s (1.00x)
memchr/sherlock/rare/small2      rust/memchr/memchr2  20.6 GB/s (1.00x)    20.6 GB/s (1.00x)
memchr/sherlock/rare/small3      rust/memchr/memchr3  15.5 GB/s (1.00x)    15.5 GB/s (1.00x)
memchr/sherlock/rare/tiny2       rust/memchr/memchr2  2.1 GB/s (1.00x)     2.1 GB/s (1.00x)
memchr/sherlock/rare/tiny3       rust/memchr/memchr3  2.1 GB/s (1.00x)     2.1 GB/s (1.00x)
memchr/sherlock/uncommon/huge2   rust/memchr/memchr2  4.4 GB/s (1.02x)     4.5 GB/s (1.00x)
memchr/sherlock/uncommon/huge3   rust/memchr/memchr3  3.1 GB/s (1.01x)     3.1 GB/s (1.00x)
memchr/sherlock/uncommon/small2  rust/memchr/memchr2  6.9 GB/s (1.00x)     6.9 GB/s (1.00x)
memchr/sherlock/uncommon/small3  rust/memchr/memchr3  4.8 GB/s (1.00x)     4.8 GB/s (1.00x)
memchr/sherlock/uncommon/tiny2   rust/memchr/memchr2  1096.7 MB/s (1.00x)  1096.7 MB/s (1.00x)
memchr/sherlock/uncommon/tiny3   rust/memchr/memchr3  822.5 MB/s (1.00x)   822.5 MB/s (1.00x)

@BurntSushi
Copy link
Owner

BurntSushi commented Oct 17, 2025

I think I had fiddled with this when writing this initially. And got similar results. And that's why I didn't do this, IIRC.

This generally doesn't have an impact for never/rare/uncommon benchmarks because the redundant movemasks only appear when a match occurs. And in those benchmarks, matches are rare.

But when matches are common, this change seems to regress runtime. A 1.07x regression is enough to give me pause here, particularly on a huge haystack where noise doesn't tend to be as much as a factor.

@adamreichold
Copy link
Author

adamreichold commented Oct 17, 2025

A 1.07x regression is enough to give me pause here, particularly on a huge haystack where noise doesn't tend to be as much as a factor.

I would not put too much trust in that particular number, or rather the system it was measured on. Notebook running amd_pstate in active mode. I saw similar swings in both directions intermittently which is why I currently consider it noise and would certainly would want to remeasure on a quieter system (or at least this system with a simpler cpufreq driver) before committing to anything.

But when matches are common, this change seems to regress runtime.

So I think why I am here is basically that I do not understand, how doing more work (extra ors and extra movemasks) can result in faster code. Especially since I do not see additional register dependencies appearing either.

@BurntSushi
Copy link
Owner

All righty. I'll try to run these when I get chance. I have machines on my LAN specifically dedicated to benchmarking that are quiet.

But when matches are common, this change seems to regress runtime.

So I think why I am here is basically that I do not understand, how doing more work (extra ors and extra movemasks) can result in faster code. Especially since I do not see additional register dependencies appearing either.

I dunno. The Assembly generated might provide a clue. And if the code generated reflects the elimination of the redundant movemasks and it's still slower, then we might be looking at CPU effects that are beyond my comprehension.

@adamreichold
Copy link
Author

adamreichold commented Oct 17, 2025

The Assembly generated might provide a clue.

Looking at the relevant basic blocks for memchr::arch::x86_64::memchr::memchr2_raw::find_avx2 on master:

.LBB12_9:
        vmovdqa ymm3, ymmword ptr [rdx]
        vmovdqa ymm6, ymmword ptr [rdx + 32]
        vpcmpeqb ymm4, ymm0, ymm3
        vpcmpeqb ymm5, ymm1, ymm3
        vpcmpeqb ymm3, ymm1, ymm6
        vpcmpeqb ymm2, ymm0, ymm6
        vpor ymm6, ymm4, ymm2
        vpor ymm7, ymm3, ymm5
        vpor ymm6, ymm7, ymm6
        vpmovmskb esi, ymm6
        test esi, esi
        jne .LBB12_28
        add rdx, 64
        cmp rdx, rax
        jbe .LBB12_9
.LBB12_28:
        vpor ymm0, ymm4, ymm5
        vpmovmskb eax, ymm0
        test eax, eax
        jne .LBB12_17
        vpor ymm0, ymm2, ymm3
        vpmovmskb eax, ymm0
        tzcnt eax, eax
        lea rdx, [rdx + rax + 32]
        jmp .LBB12_19

and with this change:

.LBB12_9:
        vmovdqa ymm2, ymmword ptr [rdx]
        vmovdqa ymm4, ymmword ptr [rdx + 32]
        vpcmpeqb ymm3, ymm0, ymm2
        vpcmpeqb ymm2, ymm1, ymm2
        vpcmpeqb ymm5, ymm0, ymm4
        vpor ymm3, ymm2, ymm3
        vpcmpeqb ymm2, ymm1, ymm4
        vpor ymm2, ymm2, ymm5
        vpor ymm4, ymm3, ymm2
        vpmovmskb esi, ymm4
        test esi, esi
        jne .LBB12_28
        add rdx, 64
        cmp rdx, rax
        jbe .LBB12_9
.LBB12_28:
        vpmovmskb eax, ymm3
        test eax, eax
        jne .LBB12_17
        vpmovmskb eax, ymm2
        tzcnt eax, eax
        lea rdx, [rdx + rax + 32]
        jmp .LBB12_19

So the compiler already lifts those (..).movemask.or((..).movemask()) into (..).or(..).movemask()), i.e. does OR of the vector masks instead of ORing the bitmasks. Which I would interpret, that the modified code is actually closer to the assembly.

Furthermore, the modified code does indeed have two fewer vpor instructions. And it thereby has to reference two fewer registers (ymm2/3 instead of ymm2/3/4/5) in final basic block .LBB12_28.

The modified code sees interleaving of vpcmpegb and vpor and reuses ymm2 for the fourth vpcmpeqb. So this might mean adding a spurious dependency between the comparisons? Shouldn't the CPU's register renaming break it though?

To me, this actually makes it more mysterious. I wonder one could add a compiler barrier or something to avoid this reordering...

@adamreichold
Copy link
Author

One could unroll that loop one more time while staying within the same (vector) register budget. Benchmarks are similarly inconclusive though. Some wins, some losses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants