-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NEON intrinsics are broken on big-endian #1484
Comments
stdarch no longer provide SIMD on big-endian ARM due to rust-lang/stdarch#1484
stdarch no longer provide SIMD on big-endian ARM due to rust-lang/stdarch#1484
What exactly does this mean? Is there a bug in LLVM? If so, where is it tracked? Or is the problem that Rust stdarch wants to expose the intrinsics the way they work on hardware, but LLVM doesn't provide those semantics? If so, could that be fixed by doing appropriate translation of indices before calling the intrinsics? |
The short answer is that, on big-endian, LLVM portable vectors have a different element ordering than the one in the vector types used by the NEON intrinsics. The C intrinsics work around this by reversing the element ordering in vectors before & after each intrinsic. We need to do the same in stdarch. |
Oh I see, so this is a mismatch about the OTOH this is good news for portable-simd, seems like there we'll be getting consistent behavior across platforms without extra work then. |
stdarch no longer provide SIMD on big-endian ARM due to rust-lang/stdarch#1484
stdarch no longer provide SIMD on big-endian ARM due to rust-lang/stdarch#1484
stdarch no longer provide SIMD on big-endian ARM due to rust-lang/stdarch#1484
stdarch no longer provide SIMD on big-endian ARM due to rust-lang/stdarch#1484
stdarch no longer provide SIMD on big-endian ARM due to rust-lang/stdarch#1484
stdarch no longer provide SIMD on big-endian ARM due to rust-lang/stdarch#1484
As noted in rust-lang/stdarch#1484, the NEON intrinsics are broken on big-endian aarch64. This is part of fixing rust to build for & on big-endian aarch64, following up rust-lang/rust#129819.
Neon / SIMD is known to be problematical in rust, ref. rust-lang/stdarch#1484, even though the CPU itself supports it.
This is done by avoiding attempts at using neon / SIMD in big-endian mode by patching some of the vendored crates. Neon / SIMD is known to be problematical in rust, ref. rust-lang/stdarch#1484, even though the CPU itself supports it. I've also tried reporting the memchr fixes upstream, ref. BurntSushi/memchr#162 So far not yet adopted. Zerocopy has also received a pull request: google/zerocopy#1795
Do this by avoiding trying to use neon / SIMD on big-endian aarch64. Neon intrinsics are problematical on big-endian targets, ref. rust-lang/stdarch#1484
The actual place that intrinsics themselves are handled is in two places: if it's an architecture-specific intrinsic, it uses |
As stated elsewhere, I reported the issue rust-lang/rust#129819 earlier, and now have workarounds in place so that I'm able to produce a working rust compiler on big-endian NetBSD/aarch64 by avoiding attempts to use the NEON extensions in that mode. However, since those extensions are available in the CPU, a better solution would be to fix this issue and then probably to revert the workarounds. Since I'm a relative rust newbie, I have been thinking about what it would take to get some forward motion on that underlying issue. I have so far come to the conclusion that it would be helpful to have a test program (in rust, of course), which excercises / validates all the NEON SIMD extensions, runnable on 64-bit little-endian aarch64 system. Perhaps such a program already exists, and it's just a matter of pointing to it? My newbie status would make it difficult for me to come up with such a program, and I am hoping that it would be helpful in exploring a fix to this underlying issue. Since I do this in my copious spare time (as the expression goes), I can't make any firm commitments, but I think this will be a useful starting point for anyone wanting to tackle this issue properly. The second worry I have is whether adding swizzling / byte-swapping of arguments and results before/after using NEON intrinsics will tend to negate the gains otherwise achieved by the NEON extensions compared to little-endian mode. I don't have a good intuition for that -- anyone have a better suggestion for that? Ideally, the test program could also do some measurement / validation of that? (Or would that be asking too much? It would not be required to act as the initial stepping stone, at least.) |
I assume LLVM has to insert its own byte swapping when lowering the portable SIMD operations on big-endian NEON... so hopefully the codegen backend is good enough to realize that the two swaps cancel each other out, and remove both of them? If not, that seems worth reporting as an LLVM bug.
The stdarch test suite should be able to serve that purpose. The tricky part probably is that we don't have a way to run it on CI. Miri can run some of it (when it only needs generic SIMD intrinsics), not sure if that is good enough to gain confidence for re-landing the intrinsics. |
I also found this gem in the LLVM docs mentioned above:
Is that something the frontend has to do? That would mean we need a special case in our ABI handling code to use |
...whaaa? so a |
That's how I understand this, yes. |
Hmm, then I need to go look there (pointer to directory?), and see if the NEON stuff is easily identifiable / isolateable.
I am "old school", so was thinking foremost of doing the development without any CI support. Getting CI in place for this would then be a separate issue to be tackled separately. |
We shouldn't land anything without CI support, but ofc you can develop in whatever order suits you best. :) |
I completely understand, and probably agree, and recall having seen hints which might help in that direction. We'll see. First things first. |
The easiest way to support big-endian would be to migrate all NEON intrinsics to use the stdarch-gen code generation framework and then have that automatically insert the needed swizzles on big-endian. However this is a huge amount of work and not a trivial undertaking. Another approach would be to adapt the new code generator used for SVE intrinsics in #1509 to also generate NEON intriniscs, which may be easier to work with than the current code generator. cc @JamieCunliffe |
Then I do not understand. As I understand it, doing swaps both pre- and post-SIMD operations are a necessary part of making the SIMD operations work as intended. They are therefore not "cancellable".
Sadly, this fails the "here is a concrete set of operations to test & validate" test. It's like saying to me "it is in there, somewhere, in the rust compiler sources -- you go figure out where and what by yourself". |
Yeah the test suite is complicated I am afraid -- I was just giving you some pointers that hopefully lead into the right direction. That's all I can offer, sorry. I don't know how the stdarch test suite is set up. I'm afraid I don't think there is such a thing as a simple test program; stdarch offers many thousand operations and has some fancy setup to test them all. |
You can look at the |
OK, I understand. I have found at least some of what needs to be looked at, and I'm trying to follow the various suggestions here. Next will need to be some experimentation etc. We'll see how that goes. Thanks anyway! |
Did something recently happen in this space? Zerocopy's nightly CI started failing this past weekend due to SIMD intrinsic name resolution errors on |
@jswrenn That is because rust-lang/rust#132714 bumped memchr without considering aarch64_be is not supported in the latest memchr (BurntSushi/memchr#162). (According to the author of that rust-lang/rust PR, the memchr bump might be able to be reverted: taiki-e/atomic-maybe-uninit@6ea62cc#commitcomment-148891566) |
FWIW I was going to look at getting this working once the new ARM generator PR has gone in |
These are currently broken because the order of elements inside vectors is reversed on big-endian systems: the ARM ABI requires that element 0 is located at the highest address of the vector type. However LLVM intrinsics expect element 0 to be located at the lowest address.
See https://llvm.org/docs/BigEndianNEON.html and
arm_neon.h
in Clang for more details.The text was updated successfully, but these errors were encountered: