-
Notifications
You must be signed in to change notification settings - Fork 281
Description
Hi team,
I'm working with the rust-lang team on a concern that would overlap with cpu_features, and am specifically here because this project has been around for a long time, is well regarded, and looks to be very mature and well-designed!
My question is fairly simple - have you run into any issues due to a combination of inlined functions and code motion? Are there any steps the project itself takes or any guarantees that make you feel sufficiently confident that compiler optimizations can't cause instructions to be reordered out of the confines of the protective if statement?
For example, if you had the following contrived code sample:
#include "cpuinfo_x86.h"
static const X86Features features = GetX86Info().features;
void Compute(int m, int n) {
int x, y;
if (features.bmi2) {
x = _bzhi_u32(m, n);
y = x + 1;
} else {
y = m + n;
}
printf("%d\n", y);
}
Is there anything to prevent a (smart but not too smart) optimizing compiler from noticing that 1) _bzhi_u32()
is a known, inlinable function with zero observable side effects, 2) x
may be calculated without affecting the else
branch, 3) features.bmi2
is always tested and changing the code to
void Compute(int m, int n) {
...
bzhi x, m, n;
incr x;
add y, m, n;
test features.bmi2;
cmov y, x;
...
}
(or just keeping the jmp
but calculating bzhi
beforehand)
The project README doesn't mention that the burden is on a developer to make sure any code using the detected features in the if (features.foo) { ... }
branch cannot be inlined to protect against code motion, so I'm wondering if there are any sections of the C or C++ specification or the GCC/LLVM internal APIs that document, e.g., code motion cannot happen across a non-inlined condition or anything else that would prevent a (compliant) compiler from generating code that possibly uses an unsupported instruction.
Activity
toor1245 commentedon Aug 9, 2022
Hi @mqudsi, thanks for the issue! I think we should consider this case and add an appropriate approach to the documentation.
I have finished investigating this issue. First of all, it depends on compilers and versions of the compiler, so I don't think that it is possible to take into account all compiler optimizations, probably you can turn off optimization on specific code:
https://docs.microsoft.com/en-us/cpp/preprocessor/optimize?view=msvc-170
https://stackoverflow.com/questions/34902857/clang-do-not-optimize-a-specific-function
Theoretical solutions
Detection features via another process
I found this approach in the commit of mimalloc that we can define macros at build time from another process's computations and define appropriate code on preprocessor directives
https://github.com/jserv/mimalloc/blob/detect-cache-line-size/CMakeLists.txt#L206-L228
the advantage is that the whole logic will be defined in compilation time.
the disadvantage of this approach is that you need to write an additional program to get
OUTPUT_VARIABLE
Allocate funtions at runtime
The next solution I found in the blog of IKVM.NET How to Hack Your Own JIT Intrinsic and https://github.com/damageboy/ReJit
Need to write asm files with/without support of bmi2 by template:
compute_with_bmi2.asm
,compute_default.asm
HackJit.cs
Also, you can do it for C/C++
https://stackoverflow.com/questions/10456245/using-c-with-assembly-to-allocate-and-create-new-functions-at-runtime
I guess there is some library for getting opcodes in C/C++.
The disadvantage of this approach is that you need to write code for a specific architecture
GNU Lighting
Also, you can create generated code via GNU lighting, see:
https://www.gnu.org/software/lightning
https://www.gnu.org/software/lightning/manual/html_node/Fibonacci.html#Fibonacci
AsmJit
https://asmjit.com/
https://github.com/asmjit/asmjit
cc: @gchatelet, @Mizux
mqudsi commentedon Aug 12, 2022
@toor1245 that's some extensive research! Thanks for taking the time to compile it all into one post.
I think it's important to distinguish between workarounds that cause the application to effectively hardcode its support for a CPU instruction vs those that allow runtime differentiation (which, as I understand it, is the raison d'être for the cpu_features library in the first place). To that end, I think the first approach is out (and it's already possible if you just use the compiler-provided defines like
__BMI2__
or__RDRND__
that are set when use-march=...
or-mbmi2
, although those aren't necessary if you are writing your own assembly).I might be wrong, but there is probably a generic alternative to prevent inlining and code motion here available by simply changing the shape of the feature detection (and making sure the cpu_features library's own code is in a separate compilation unit and isn't inlined and optimized against with LTO or similar).
e.g. instead of using
if (features.bmi2) { .. }
the library could expose (C++ only, though?) an alternative api that would be more resistant to code motion, taking a callback/closure that is conditionally evaluated if the feature is present:With an interface along those lines, the opportunities for the compiler to reorder code across the conditional are greatly restricted (again, I'm not sure if they're eliminated altogether depending on how smart the compiler is and whether or not the cpu_features library is optimized with the rest of the code as one unit).
toor1245 commentedon Aug 14, 2022
@mqudsi, I started comparing the assembler code generation of your
Compute
function example across supported cpu_features compilers (clang, msvc, gcc - latest versions) but can't reproduce this case with code motionx86-64 gcc 12.1 flags: -march="skylake" -Ofast
x86-64 clang 14.0.0 flags: -march="skylake" -Ofast
ref: https://gcc.godbolt.org/z/6WrY7GMM6 (gcc)
ref: https://gcc.godbolt.org/z/xoP3qnbPh (clang)
x86-64 msvc v19.latest flags: /O2
ref: https://gcc.godbolt.org/z/oEG5xh4cj (msvc)
I have tested your proposed example with callback:
ref: https://gcc.godbolt.org/z/YWTsWME6a (gcc)
ref: https://gcc.godbolt.org/z/W3PrMYn1c (msvc)
So, probably we can introduce something like that, but need to make sure that in the real case everything will work fine, could you provide the compiler version and flags that you tested
Compute
function or an example of code that produces code motion for further research, please?