Poor optimization of switch statement in Clang 19.1.0 compared to Clang 18.1.0 #127365

inicula · 2025-02-16T03:05:08Z

I was doing some tests/benchmarks regarding switch vs array look-ups and found this change in behavior from Clang 18.1.0 to Clang 19.1.0 (and current trunk):

https://godbolt.org/z/o4TYYdr4e (original code)
https://godbolt.org/z/1bh3KdPzY (more minimal reproduction, without the benchmark boilerplate)

Clang 18.1.0 optimizes that big switch as a constant lookup table:

        lea     rcx, [rip + .Lswitch.table.main]
.LBB0_12:
        movzx   edx, byte ptr [rbx + rax - 3]
        xor     edx, 128
        mov     rdx, qword ptr [rcx + 8*rdx]
        inc     byte ptr [rsp + rdx + 112]

On the other hand, Clang 19.1.0 generates a separate label for each switch case, and every label feeds into a main one:

.LBB0_28:
        lea     rcx, [rsp + 665] ; case 1: return 553;
        jmp     .LBB0_283
.LBB0_29:
        lea     rcx, [rsp + 653] ; case 2: return 541;
        jmp     .LBB0_283
; ..............................
.LBB0_283:
        inc     byte ptr [rcx]
        inc     rbp
        cmp     rbp, 300000000
        je      .LBB0_20
        movzx   ecx, byte ptr [rbx + rbp]
        movsxd  rdx, dword ptr [rax + 4*rcx]
        add     rdx, rax
        mov     rcx, r13
        jmp     rdx

This can tank the performance, for example if the branch predictor can't accurately predict which label you're going to access on the current iteration. In my example I'm generating random indexes and with perf stat I'm seeing almost 300 million branch misses (one for each increment() invocation).

Assuming that this change isn't an intentional trade-off made for a benefit in some other usecases, then this is a regression.

On my machine, the results of running that binary (same source code as the one in Godbolt) compiled with Clang 18 vs Clang 19 are as follows:

clang 18 = elapsed: 254ms sum: 28928
clang 19 = elapsed: 2813ms sum: 29184

So the binary generated by Clang 19 is about 11 times slower.

NOTE: The issue seems to be related to inlining, because if I add __attribute__((noinline)) to the increment() function, then Clang 19 optimizes it with a lookup table, just like Clang 18, and the result is much faster than what I get with inlining allowed:

elapsed: 548ms sum: 28928

The text was updated successfully, but these errors were encountered:

inicula · 2025-02-16T04:14:39Z

I added a more minimal reproduction (without the benchmarking boilerplate) in the issue description.

u4f3 · 2025-02-16T11:15:53Z

https://godbolt.org/z/9n4hGc1cs
[Relevant PR in SROA](aeubanks:sroa-phi2)
tradeoff between SROA and SimplifyCFG

nikic · 2025-02-16T11:47:07Z

Ideally we'd prevent unfolding for cases which can't be SROAd anyway, but I think the most straightforward fix for this would probably be to insert the GEPs into the phi predecessor blocks instead of the entry block. In that case we should be able to sink them down again in SimplifyCFG.

u4f3 · 2025-02-18T12:15:21Z

I'd like to work on this but I'm not very sure where is the best place to put the sink part. And I'm not sure if transforming phi ((gep ptr, idx1), (gep ptr, idx2)) to gep ptr phi(idx1, idx2) is a universal beneficial if nothing gets SROAd. If so maybe we should not put it in SimplifyCFG. Else we should sink them somewhere near switchToLookupTable in SimplifyCFG.

nikic · 2025-02-18T15:45:46Z

[...] but I think the most straightforward fix for this would probably be to insert the GEPs into the phi predecessor blocks instead of the entry block. In that case we should be able to sink them down again in SimplifyCFG.

Done in #127652, but this didn't work out because LICM still ends up hoisting the GEPs, before SimplifyCFG can sink them.

I'd like to work on this but I'm not very sure where is the best place to put the sink part. And I'm not sure if transforming phi ((gep ptr, idx1), (gep ptr, idx2)) to gep ptr phi(idx1, idx2) is a universal beneficial if nothing gets SROAd. If so maybe we should not put it in SimplifyCFG. Else we should sink them somewhere near switchToLookupTable in SimplifyCFG.

SimplifyCFG already does sinking, but it expects the instructions to the in the predecessor blocks. It won't try to sink instructions from other blocks. You could try allowing that for instructions that aren't used anywhere exception the phi node.

The other way to fix this is to prevent SROA from making the transform. The select/phi unfolding is something of a hack, because it's done unconditionally. SROA also has handling to speculate accesses of selects/phis (see e.g. speculatePHINodeLoads). I believe the gep/select unfolding is only ever actually useful in conjunction with those, so it would make sense to extend them to also support an intermediate gep. This would ensure we only perform the transform when it enables SROA. Doing this is probably fairly tricky though.

sharkautarch · 2025-02-19T13:25:02Z

@nikic What if you just changed the currently-unconditional select/phi unfolding SROA code to prevent it from unfolding gep of phis that have more than some arbitrarily chosen number of (unique) incoming values originating from the (same) switch instruction. That, or prevent the code from unfolding gep of phis with more than a certain number of incoming values entirely.
This might not be the best long term solution, but maybe (short of reverting Pr #83494 entirely) it'll be the easiest (or most straightforward) short-term solution?

llvmbot added the clang Clang issues not falling into any other category label Feb 16, 2025

EugeneZelenko added llvm:optimizations and removed clang Clang issues not falling into any other category labels Feb 16, 2025

nikic added the missed-optimization label Feb 16, 2025

nikic mentioned this issue Feb 18, 2025

[SROA] When unfolding gep of phi, insert gep into predecessor #127652

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor optimization of switch statement in Clang 19.1.0 compared to Clang 18.1.0 #127365

Poor optimization of switch statement in Clang 19.1.0 compared to Clang 18.1.0 #127365

inicula commented Feb 16, 2025 •

edited

Loading

inicula commented Feb 16, 2025

u4f3 commented Feb 16, 2025 •

edited

Loading

nikic commented Feb 16, 2025

u4f3 commented Feb 18, 2025

nikic commented Feb 18, 2025

sharkautarch commented Feb 19, 2025

Poor optimization of switch statement in Clang 19.1.0 compared to Clang 18.1.0 #127365

Poor optimization of switch statement in Clang 19.1.0 compared to Clang 18.1.0 #127365

Comments

inicula commented Feb 16, 2025 • edited Loading

inicula commented Feb 16, 2025

u4f3 commented Feb 16, 2025 • edited Loading

nikic commented Feb 16, 2025

u4f3 commented Feb 18, 2025

nikic commented Feb 18, 2025

sharkautarch commented Feb 19, 2025

inicula commented Feb 16, 2025 •

edited

Loading

u4f3 commented Feb 16, 2025 •

edited

Loading