-
Notifications
You must be signed in to change notification settings - Fork 21
Add (scalarized) safe gather and scatter ops #171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
1e0b052 to
0ab5e31
Compare
| type Gathered<T> = [T; #len]; | ||
|
|
||
| #[inline(always)] | ||
| fn gather<T: Copy>(self, src: &[T]) -> Self::Gathered<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering whether it makes sense to add a debug assertion that no index is exceeds the slice length? This way it's still possible to detect bugs instead of silently clamping to the largest index. But then we would have to remove the guarantee from the documentation that it will always be clamped.
b8d9750 to
e4fd481
Compare
|
Before merging, we should figure out what's going on with the Vello use case this was intended for. The theory is that unchecked indexing improves performance (I noticed a ton of bounds checks in the generated ARM assembly). Indeed, using unchecked indexing directly improves performance:
But the "gather" version is 525ms, slower than not using it. |
LaurenzV
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I think it would be good to have some tests as well, especially for the edge cases.
| self.simd.#min_method(self, ((src.len() - 1) as Self::Element).simd_into(self.simd)) | ||
| }; | ||
|
|
||
| let inbounds = &*inbounds; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this one's necessary? Same for the other two functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'm talking about the let inbounds = &*inbounds;.)
|
I've added tests for the scatter/gather ops (well, I generated them, but I looked over them and they cover all the edge cases). Some of these are I've also made some supporting changes to the Surprisingly, cargo-nextest can actually run I also noticed that I forgot to add Finally, I removed the unnecessary |
ca56c31 to
e0927a0
Compare
|
Before merging this, I want to get to the bottom of what's going on in vello_cpu. The purpose of these operations was to speed up The experimental "use latest Changing As such, before merging this, I think we should figure out if something is wrong with the generated code, and try to get that Vello gradient benchmark running at the same speed as it does with unchecked lookups. |
| // Converting `src.len() - 1` to `Self::Element` will not wrap, because if `src.len() - 1 >= | ||
| // Self::Element::MAX`, that means that `src.len() > Self::Element::MAX`, and we take the | ||
| // above branch instead. | ||
| self.simd.#min_method(self, ((src.len() - 1) as Self::Element).simd_into(self.simd)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running min() on every access seems slow. Can we instead find the maximum index in the vector once, cache it, and only do a single bounds check on every load for the whole vector by comparing SimdGather::max_index() against src.len()? This means SimdGather needs to be a struct instead of a trait but I don't think that's a problem.
I've described such a design in more detail here: okaneco/safe_unaligned_simd#37
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not quite that bad; we're doing it once per gather instead of once per access.
As I understand it, once you've constructed a GatherIndices, you can use it for multiple gather operations. Do you have a use case in mind for this? In e.g. Vello, I'd expect the indices to vary while the lookup table remains constant. You're describing an optimization that would improve performance when the lookup table varies but the indices remain constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your assessment is correct. I wanted to use it to load pixels from in-memory representation of RGBRGBRGBRGB into vectors RRRR, GGGG, BBBB.
This was when working on vstroebel/jpeg-encoder#17 or vstroebel/jpeg-encoder#18, I can't remember which one. But that crate doesn't use fearless_simd and doesn't have plans to, so this is pretty far down on my priority list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also as discussed in okaneco/safe_unaligned_simd#37 scatter/gather instructions are a performance minefield. LLVM will sometimes transform scalar loads into gather and tank performance that way.
Depends on #170. The PR stack is getting a bit large.
When looking into gradient rendering in Vello, I noticed that we often perform many reads from the gradient color LUT. Each of those reads requires a bounds check.
The indices into this LUT come from vector types, so theoretically, we should be able to elide all the bounds checks like this:
Unfortunately, the compiler still does not recognize that the indices are guaranteed to be in-bounds. To elide the bounds checks, we need to perform the
minoperation on every index individually, after converting it to ausize.We can avoid this by introducing a "gather" operation here which ensures all the indices are valid (panicking if the source slice is empty and clamping the indices), then doing a bunch of unchecked accesses in a loop. There's an equivalent "scatter" operation which does the same thing but for writes.
These operations work on arbitrary slice types and
gatherreturns an array. The only vector type involved is the one that holds all the indices.This PR intentionally does not introduce any operations that gather into/scatter from vector types. That also means that no hardware gather/scatter instructions are used right now, and the performance benefit comes solely from avoiding bounds checks. I'm not sure if we should be reserving the names
gatherandscatterfor operations that work on vectors.