-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Are non-SIMD fallbacks and automatic multiple-target support planned? #9
Comments
Well, if you have two 256 bit vectors and add them together, then LLVM is going to generate what it thinks is the optimal code based on the target platform.
This is a completely differest beast. It's not clear whether this library will get support for that sort of thing. It's hard to say much more though. Could you please offer some concrete examples and what you expect to happen? |
I've created a minimal test case for this here: https://github.com/pedrocr/rustc-math-bench/ Just compiling that with For simd it would be nice if I could write this loop only once: https://github.com/pedrocr/rustc-math-bench/blob/master/src/main.rs#L41-L43 And then have it be normal adds and muls if nothing is available, SSE2 if that's available or even AVX if it's available and worth it. |
See here for an example of how matrix multiplication can be sped up by large ammounts on the same architecture depending on the SIMD used: https://gist.github.com/rygorous/4172889#gistcomment-2114980 The SSE version will always exist on x86-64 but it would be nice if the faster AVX one was used if available without having to write several versions or compiling in different ways. |
Why not just write it with SIMD vectors? LLVM should take care of the code generation for you.
As I said, this is a completely different can of worms. The easy path is to tell the compiler which target you want, and the compiler will take care of the rest. Runtime switching is completely different and requires compiling every different form of the code for each target you want to support into the binary, and then cleverly runtime switching. This is not happening any time soon and it's not clear at all whether it's in scope for this library. We don't even have the right infrastructure in place for doing it on Rust stable anyway. I'd suggest you go off and build this yourself for now. You can compile individual functions for specific targets using the |
Don't know what these are and googling didn't help much. LLVM auto-vectorization?
This I've done and demonstrated a large benefit.
Ok, too bad. SIMD has a big cost in development time so it would be nice if there were ways that perhaps don't extract all the performance but allow more cases to be used.
Doing a bunch of ugly manual switching is what I'd like to avoid. But @parched apparently has a macro-based solution for this that could perhaps be used. |
Maybe it isn't clear, but I don't disagree with your stated goals. What I'm trying to tell you is that I don't know what the path forward is. You've brought up several different concerns in this single issue, so it's hard to untangle them and it's not clear whether this specific library should solve all of them. But they should definitely be solved somewhere.
Have you looked at the documentation for this crate?
Auto vectorization happens when the code generator recognizes a specific pattern and knows it can use SIMD. In theory, that might work for your code. I don't know. You'd need to check the generated assembly. That uncertainty is the primary downside of auto vectorization. This library is focused on explicit SIMD, where you can express operations using SIMD vectors and get a guarantee that the compiler will generate the best SIMD instructions for your target platform. |
Oh, I definitely understood that. I opened this issue to start the conversation and to understand how much of the solution could be in this crate.
I had seen that but hadn't realized I could just use those types with normal operations. I went looking for a multiply-add call instead. So that's cool, now I know how I'd implement my code with this crate. :)
Yep, and that's why I'd like to use explicit SIMD for some hot parts.
Would it be in the scope of this library to have the
I'd say this crate could enable 2) and 3) and allow the compiler to do 1) on its code as well. |
It already does and it's handled seamlessly by the compiler. Otherwise, I'm not really sure what you're asking for. In the general case, you need a code generator to do this, which I think is definitely out of scope for this library anyway. |
Then that's all really :) I've used the crate to implement explicit SIMD in my benchmark: pedrocr/rustc-math-bench@f81e57c Seems like llvm was already pretty good in this case but not perfect:
The code looks quite good too which is very nice. I'll be using this as soon as it works in stable. |
I'm the author of rawloader and am trying to figure out the best way to speed up image operations without a lot of code duplication. Ideally one could write the function once and have the best SIMD implementation be used in several architectures. For this a few things would need to happen:
Having 1) would make it much easier to add SIMD support to applications without having to add special cases everywhere but at least for some applications it's not strictly needed as you would never want to run it on a CPU so basic. Having 2) would take a lot of the effort away from writing SIMD implementations but it only makes sense if the performance downside of mixing SIMD with non-SIMD code doesn't make it slower than the fully non-SIMD version. I've proposed the equivalent of 3) for normal LLVM generation here:
rust-lang/rust#42432
I'm curious what the general opinion on this is. At least for basic operations like doing a FMA on a bunch of values it would be great to be able to write once and target most architectures efficiently with good fallbacks. Having a way to also use OpenCL with the same code would be even nicer and probably possible for a few simple operations.
The text was updated successfully, but these errors were encountered: