-
-
Notifications
You must be signed in to change notification settings - Fork 130
Avoid redudant mask computations in the generic Two and Three implementations. #193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Sorry for forgetting to run the tests before pushing. I obviously misunderstood something... |
83e4e82 to
f60191f
Compare
|
Ok, mixed up the order in the reverse find routine. Will rerun the benchmarks to see if anything changed now... |
Still appears to be within the noise present on my system: |
|
I think I had fiddled with this when writing this initially. And got similar results. And that's why I didn't do this, IIRC. This generally doesn't have an impact for never/rare/uncommon benchmarks because the redundant movemasks only appear when a match occurs. And in those benchmarks, matches are rare. But when matches are common, this change seems to regress runtime. A 1.07x regression is enough to give me pause here, particularly on a huge haystack where noise doesn't tend to be as much as a factor. |
I would not put too much trust in that particular number, or rather the system it was measured on. Notebook running
So I think why I am here is basically that I do not understand, how doing more work (extra ors and extra movemasks) can result in faster code. Especially since I do not see additional register dependencies appearing either. |
|
All righty. I'll try to run these when I get chance. I have machines on my LAN specifically dedicated to benchmarking that are quiet.
I dunno. The Assembly generated might provide a clue. And if the code generated reflects the elimination of the redundant movemasks and it's still slower, then we might be looking at CPU effects that are beyond my comprehension. |
Looking at the relevant basic blocks for .LBB12_9:
vmovdqa ymm3, ymmword ptr [rdx]
vmovdqa ymm6, ymmword ptr [rdx + 32]
vpcmpeqb ymm4, ymm0, ymm3
vpcmpeqb ymm5, ymm1, ymm3
vpcmpeqb ymm3, ymm1, ymm6
vpcmpeqb ymm2, ymm0, ymm6
vpor ymm6, ymm4, ymm2
vpor ymm7, ymm3, ymm5
vpor ymm6, ymm7, ymm6
vpmovmskb esi, ymm6
test esi, esi
jne .LBB12_28
add rdx, 64
cmp rdx, rax
jbe .LBB12_9
.LBB12_28:
vpor ymm0, ymm4, ymm5
vpmovmskb eax, ymm0
test eax, eax
jne .LBB12_17
vpor ymm0, ymm2, ymm3
vpmovmskb eax, ymm0
tzcnt eax, eax
lea rdx, [rdx + rax + 32]
jmp .LBB12_19and with this change: .LBB12_9:
vmovdqa ymm2, ymmword ptr [rdx]
vmovdqa ymm4, ymmword ptr [rdx + 32]
vpcmpeqb ymm3, ymm0, ymm2
vpcmpeqb ymm2, ymm1, ymm2
vpcmpeqb ymm5, ymm0, ymm4
vpor ymm3, ymm2, ymm3
vpcmpeqb ymm2, ymm1, ymm4
vpor ymm2, ymm2, ymm5
vpor ymm4, ymm3, ymm2
vpmovmskb esi, ymm4
test esi, esi
jne .LBB12_28
add rdx, 64
cmp rdx, rax
jbe .LBB12_9
.LBB12_28:
vpmovmskb eax, ymm3
test eax, eax
jne .LBB12_17
vpmovmskb eax, ymm2
tzcnt eax, eax
lea rdx, [rdx + rax + 32]
jmp .LBB12_19So the compiler already lifts those Furthermore, the modified code does indeed have two fewer The modified code sees interleaving of To me, this actually makes it more mysterious. I wonder one could add a compiler barrier or something to avoid this reordering... |
|
One could unroll that loop one more time while staying within the same (vector) register budget. Benchmarks are similarly inconclusive though. Some wins, some losses. |
This does not appear to make a difference in the
rust/memchr/memchr(2|3)$benchmarks running on x86-64 with AVX2. I would argue that it does make for simpler code though.I suspect there is a reason for this structure and did dig through the Git history but could not find anything. (I ended up at the big "rewrite everything" commit.) At least on x86-64, I would expect fewer "movemask" instructions to be preferable due to their high latency?