Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor optimization of switch statement in Clang 19.1.0 compared to Clang 18.1.0 #127365

Open
inicula opened this issue Feb 16, 2025 · 6 comments
Open

Comments

@inicula
Copy link

inicula commented Feb 16, 2025

I was doing some tests/benchmarks regarding switch vs array look-ups and found this change in behavior from Clang 18.1.0 to Clang 19.1.0 (and current trunk):

Clang 18.1.0 optimizes that big switch as a constant lookup table:

        lea     rcx, [rip + .Lswitch.table.main]
.LBB0_12:
        movzx   edx, byte ptr [rbx + rax - 3]
        xor     edx, 128
        mov     rdx, qword ptr [rcx + 8*rdx]
        inc     byte ptr [rsp + rdx + 112]

On the other hand, Clang 19.1.0 generates a separate label for each switch case, and every label feeds into a main one:

.LBB0_28:
        lea     rcx, [rsp + 665] ; case 1: return 553;
        jmp     .LBB0_283
.LBB0_29:
        lea     rcx, [rsp + 653] ; case 2: return 541;
        jmp     .LBB0_283
; ..............................
.LBB0_283:
        inc     byte ptr [rcx]
        inc     rbp
        cmp     rbp, 300000000
        je      .LBB0_20
        movzx   ecx, byte ptr [rbx + rbp]
        movsxd  rdx, dword ptr [rax + 4*rcx]
        add     rdx, rax
        mov     rcx, r13
        jmp     rdx

This can tank the performance, for example if the branch predictor can't accurately predict which label you're going to access on the current iteration. In my example I'm generating random indexes and with perf stat I'm seeing almost 300 million branch misses (one for each increment() invocation).

Assuming that this change isn't an intentional trade-off made for a benefit in some other usecases, then this is a regression.

On my machine, the results of running that binary (same source code as the one in Godbolt) compiled with Clang 18 vs Clang 19 are as follows:

clang 18 = elapsed: 254ms sum: 28928
clang 19 = elapsed: 2813ms sum: 29184

So the binary generated by Clang 19 is about 11 times slower.

NOTE: The issue seems to be related to inlining, because if I add __attribute__((noinline)) to the increment() function, then Clang 19 optimizes it with a lookup table, just like Clang 18, and the result is much faster than what I get with inlining allowed:

elapsed: 548ms sum: 28928
@llvmbot llvmbot added the clang Clang issues not falling into any other category label Feb 16, 2025
@EugeneZelenko EugeneZelenko added llvm:optimizations and removed clang Clang issues not falling into any other category labels Feb 16, 2025
@inicula
Copy link
Author

inicula commented Feb 16, 2025

I added a more minimal reproduction (without the benchmarking boilerplate) in the issue description.

@u4f3
Copy link

u4f3 commented Feb 16, 2025

https://godbolt.org/z/9n4hGc1cs
[Relevant PR in SROA](aeubanks:sroa-phi2)
tradeoff between SROA and SimplifyCFG

@nikic
Copy link
Contributor

nikic commented Feb 16, 2025

Ideally we'd prevent unfolding for cases which can't be SROAd anyway, but I think the most straightforward fix for this would probably be to insert the GEPs into the phi predecessor blocks instead of the entry block. In that case we should be able to sink them down again in SimplifyCFG.

@u4f3
Copy link

u4f3 commented Feb 18, 2025

I'd like to work on this but I'm not very sure where is the best place to put the sink part. And I'm not sure if transforming phi ((gep ptr, idx1), (gep ptr, idx2)) to gep ptr phi(idx1, idx2) is a universal beneficial if nothing gets SROAd. If so maybe we should not put it in SimplifyCFG. Else we should sink them somewhere near switchToLookupTable in SimplifyCFG.

@nikic
Copy link
Contributor

nikic commented Feb 18, 2025

[...] but I think the most straightforward fix for this would probably be to insert the GEPs into the phi predecessor blocks instead of the entry block. In that case we should be able to sink them down again in SimplifyCFG.

Done in #127652, but this didn't work out because LICM still ends up hoisting the GEPs, before SimplifyCFG can sink them.

I'd like to work on this but I'm not very sure where is the best place to put the sink part. And I'm not sure if transforming phi ((gep ptr, idx1), (gep ptr, idx2)) to gep ptr phi(idx1, idx2) is a universal beneficial if nothing gets SROAd. If so maybe we should not put it in SimplifyCFG. Else we should sink them somewhere near switchToLookupTable in SimplifyCFG.

SimplifyCFG already does sinking, but it expects the instructions to the in the predecessor blocks. It won't try to sink instructions from other blocks. You could try allowing that for instructions that aren't used anywhere exception the phi node.

The other way to fix this is to prevent SROA from making the transform. The select/phi unfolding is something of a hack, because it's done unconditionally. SROA also has handling to speculate accesses of selects/phis (see e.g. speculatePHINodeLoads). I believe the gep/select unfolding is only ever actually useful in conjunction with those, so it would make sense to extend them to also support an intermediate gep. This would ensure we only perform the transform when it enables SROA. Doing this is probably fairly tricky though.

@sharkautarch
Copy link

@nikic What if you just changed the currently-unconditional select/phi unfolding SROA code to prevent it from unfolding gep of phis that have more than some arbitrarily chosen number of (unique) incoming values originating from the (same) switch instruction. That, or prevent the code from unfolding gep of phis with more than a certain number of incoming values entirely.
This might not be the best long term solution, but maybe (short of reverting Pr #83494 entirely) it'll be the easiest (or most straightforward) short-term solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants