Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are non-SIMD fallbacks and automatic multiple-target support planned? #9

Closed
pedrocr opened this issue Jun 5, 2017 · 9 comments
Closed

Comments

@pedrocr
Copy link

pedrocr commented Jun 5, 2017

I'm the author of rawloader and am trying to figure out the best way to speed up image operations without a lot of code duplication. Ideally one could write the function once and have the best SIMD implementation be used in several architectures. For this a few things would need to happen:

  1. When SIMD isn't available at all fallback to a implementation of the same instructions
  2. Have the same function call the ideal function depending on if the target has SSE/AVX/etc
  3. Auto-generate all the target variations (e.g., with and without AVX on x86-64) and dispatch between then at runtime

Having 1) would make it much easier to add SIMD support to applications without having to add special cases everywhere but at least for some applications it's not strictly needed as you would never want to run it on a CPU so basic. Having 2) would take a lot of the effort away from writing SIMD implementations but it only makes sense if the performance downside of mixing SIMD with non-SIMD code doesn't make it slower than the fully non-SIMD version. I've proposed the equivalent of 3) for normal LLVM generation here:

rust-lang/rust#42432

I'm curious what the general opinion on this is. At least for basic operations like doing a FMA on a bunch of values it would be great to be able to write once and target most architectures efficiently with good fallbacks. Having a way to also use OpenCL with the same code would be even nicer and probably possible for a few simple operations.

@pedrocr pedrocr changed the title Will the simd calls fallback to normal code? Are non-SIMD fallbacks and automatic multiple-target support planned? Jun 5, 2017
@BurntSushi
Copy link
Contributor

Well, if you have two 256 bit vectors and add them together, then LLVM is going to generate what it thinks is the optimal code based on the target platform.

dispatch between then at runtime

This is a completely differest beast. It's not clear whether this library will get support for that sort of thing.

It's hard to say much more though. Could you please offer some concrete examples and what you expect to happen?

@pedrocr
Copy link
Author

pedrocr commented Jun 5, 2017

I've created a minimal test case for this here:

https://github.com/pedrocr/rustc-math-bench/

Just compiling that with target-cpu=native gives me a ~40% reduction in runtime so it's well worth it and I would like for rust to just compile the same code with several targets and decide between them at runtime. That's the issue I opened on rust-lang and is applicable to more than just SIMD.

For simd it would be nice if I could write this loop only once:

https://github.com/pedrocr/rustc-math-bench/blob/master/src/main.rs#L41-L43

And then have it be normal adds and muls if nothing is available, SSE2 if that's available or even AVX if it's available and worth it.

@pedrocr
Copy link
Author

pedrocr commented Jun 5, 2017

See here for an example of how matrix multiplication can be sped up by large ammounts on the same architecture depending on the SIMD used:

https://gist.github.com/rygorous/4172889#gistcomment-2114980

The SSE version will always exist on x86-64 but it would be nice if the faster AVX one was used if available without having to write several versions or compiling in different ways.

@BurntSushi
Copy link
Contributor

BurntSushi commented Jun 5, 2017

And then have it be normal adds and muls if nothing is available, SSE2 if that's available or even AVX if it's available and worth it.

Why not just write it with SIMD vectors? LLVM should take care of the code generation for you.

compiling in different ways

As I said, this is a completely different can of worms. The easy path is to tell the compiler which target you want, and the compiler will take care of the rest. Runtime switching is completely different and requires compiling every different form of the code for each target you want to support into the binary, and then cleverly runtime switching. This is not happening any time soon and it's not clear at all whether it's in scope for this library. We don't even have the right infrastructure in place for doing it on Rust stable anyway. I'd suggest you go off and build this yourself for now. You can compile individual functions for specific targets using the #[target_feature = "..."] attribute, and there are various CPU id crates out there that will help you do runtime switching.

@pedrocr
Copy link
Author

pedrocr commented Jun 5, 2017

Why not just write it with SIMD vectors? LLVM should take care of the code generation for you.

Don't know what these are and googling didn't help much. LLVM auto-vectorization?

As I said, this is a completely different can of worms. The easy path is to tell the compiler which target you want, and the compiler will take care of the rest.

This I've done and demonstrated a large benefit.

Runtime switching is completely different and requires compiling every different form of the code for each target you want to support into the binary, and then cleverly runtime switching. This is not happening any time soon and it's not clear at all whether it's in scope for this library.

Ok, too bad. SIMD has a big cost in development time so it would be nice if there were ways that perhaps don't extract all the performance but allow more cases to be used.

We don't even have the right infrastructure in place for doing it on Rust stable anyway. I'd suggest you go off and build this yourself for now.

Doing a bunch of ugly manual switching is what I'd like to avoid. But @parched apparently has a macro-based solution for this that could perhaps be used.

@BurntSushi
Copy link
Contributor

BurntSushi commented Jun 5, 2017

Maybe it isn't clear, but I don't disagree with your stated goals. What I'm trying to tell you is that I don't know what the path forward is. You've brought up several different concerns in this single issue, so it's hard to untangle them and it's not clear whether this specific library should solve all of them. But they should definitely be solved somewhere.

Don't know what these are and googling didn't help much.

Have you looked at the documentation for this crate? f32x4 is a SIMD vector, for example, that is roughly equivalent to __m128 in Intel's intrinsic headers. You can see lots of examples using SIMD vectors here: https://github.com/rust-lang-nursery/simd/tree/master/examples

LLVM auto-vectorization?

Auto vectorization happens when the code generator recognizes a specific pattern and knows it can use SIMD. In theory, that might work for your code. I don't know. You'd need to check the generated assembly. That uncertainty is the primary downside of auto vectorization. This library is focused on explicit SIMD, where you can express operations using SIMD vectors and get a guarantee that the compiler will generate the best SIMD instructions for your target platform.

@pedrocr
Copy link
Author

pedrocr commented Jun 5, 2017

Oh, I definitely understood that. I opened this issue to start the conversation and to understand how much of the solution could be in this crate.

Have you looked at the documentation for this crate? f32x4 is a SIMD vector, for example, that is roughly equivalent to __m128 in Intel's intrinsic headers.

I had seen that but hadn't realized I could just use those types with normal operations. I went looking for a multiply-add call instead. So that's cool, now I know how I'd implement my code with this crate. :)

Auto vectorization happens when the code generator recognizes a specific pattern and knows it can use SIMD. In theory, that might work for your code. I don't know. You'd need to check the generated assembly. That uncertainty is the primary downside of auto vectorization.

Yep, and that's why I'd like to use explicit SIMD for some hot parts.

This library is focused on explicit SIMD, where you can express operations using SIMD vectors and get a guarantee that the compiler will generate the best SIMD instructions for your target platform.

Would it be in the scope of this library to have the f32x4 type have a no-SIMD fallback? Because with that a reasonable plan for SIMD speedups in rust code without loss of portability could be:

  1. Enable the annotation for the compiler to compile that function for several platforms and dispatch dinamically in runtime
  2. For select operations use the simd crate vector types so some operations are explicitly SIMD in all architectures that have some (no effect on others)
  3. For select operations write architecture specific code for larger speedups when it's worth it

I'd say this crate could enable 2) and 3) and allow the compiler to do 1) on its code as well.

@BurntSushi
Copy link
Contributor

Would it be in the scope of this library to have the f32x4 type have a no-SIMD fallback?

It already does and it's handled seamlessly by the compiler. Otherwise, I'm not really sure what you're asking for. In the general case, you need a code generator to do this, which I think is definitely out of scope for this library anyway.

@pedrocr
Copy link
Author

pedrocr commented Jun 5, 2017

Then that's all really :) I've used the crate to implement explicit SIMD in my benchmark:

pedrocr/rustc-math-bench@f81e57c

Seems like llvm was already pretty good in this case but not perfect:

Compilation nomal code simd code
-O3 16.25 14.70 (-10%)
-O3 -C target-cpu=native 7.06 6.31 (-11%)

The code looks quite good too which is very nice. I'll be using this as soon as it works in stable.

@pedrocr pedrocr closed this as completed Jun 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants