The fundamental branchless swap_if code produces suboptimal code on x86-64. I ported it to Rust and noticed that changing it yielded a 50% performance uplift for that function on Zen3, this will of course depend on the the hardware, but cmov seems to yield better results than setl/setg style code that is currently being produced. Probably helped by doing 8 instead of 10 instructions.
Here is the current version:
And here is the version that produces cmov code:
I think if you can find a way to reliably produce cmov instructions like LLVM does, you should see a noticeable speed improvement.