Add (scalarized) safe gather and scatter ops #171

valadaptive · 2025-12-15T23:55:19Z

Depends on #170. The PR stack is getting a bit large.

When looking into gradient rendering in Vello, I noticed that we often perform many reads from the gradient color LUT. Each of those reads requires a bounds check.

The indices into this LUT come from vector types, so theoretically, we should be able to elide all the bounds checks like this:

let mut indices: u32x16<S> = [...];
assert!(!self.lut.is_empty());
assert!(self.lut.len() <= u32::MAX);
indices = indices.min((self.lut.len() - 1) as u32);

Unfortunately, the compiler still does not recognize that the indices are guaranteed to be in-bounds. To elide the bounds checks, we need to perform the min operation on every index individually, after converting it to a usize.

We can avoid this by introducing a "gather" operation here which ensures all the indices are valid (panicking if the source slice is empty and clamping the indices), then doing a bunch of unchecked accesses in a loop. There's an equivalent "scatter" operation which does the same thing but for writes.

These operations work on arbitrary slice types and gather returns an array. The only vector type involved is the one that holds all the indices.

This PR intentionally does not introduce any operations that gather into/scatter from vector types. That also means that no hardware gather/scatter instructions are used right now, and the performance benefit comes solely from avoiding bounds checks. I'm not sure if we should be reserving the names gather and scatter for operations that work on vectors.

fearless_simd_gen/src/mk_simd_types.rs

LaurenzV · 2025-12-22T16:37:25Z

fearless_simd_gen/src/mk_simd_types.rs

+                    type Gathered<T> = [T; #len];
+
+                    #[inline(always)]
+                    fn gather<T: Copy>(self, src: &[T]) -> Self::Gathered<T> {


I'm wondering whether it makes sense to add a debug assertion that no index is exceeds the slice length? This way it's still possible to detect bugs instead of silently clamping to the largest index. But then we would have to remove the guarantee from the documentation that it will always be clamped.

fearless_simd_gen/src/mk_simd_types.rs

valadaptive · 2025-12-24T22:09:53Z

Before merging, we should figure out what's going on with the Vello use case this was intended for. The theory is that unchecked indexing improves performance (I noticed a ton of bounds checks in the generated ARM assembly). Indeed, using unchecked indexing directly improves performance:

btw, I tried the inlining you suggested on main again. If I do that, as mentioned performance drops (from around 470ms to 520ms). however, if I replace the lut check with an unchecked access in the inlined version, it goes down to 460ms! If I do the same on main(just changing to unchecked), I get the same performance, so I think for some reason the compiler is able to optimize some of the index checks away in current main, but it stops working once you inline the iterator... Even though it doesnt make much sense

But the "gather" version is 525ms, slower than not using it.

LaurenzV

LGTM, but I think it would be good to have some tests as well, especially for the edge cases.

LaurenzV · 2025-12-26T10:39:13Z

fearless_simd_gen/src/mk_simd_types.rs

+                            self.simd.#min_method(self, ((src.len() - 1) as Self::Element).simd_into(self.simd))
+                        };
+
+                        let inbounds = &*inbounds;


I don't think this one's necessary? Same for the other two functions.

(I'm talking about the let inbounds = &*inbounds;.)

valadaptive · 2025-12-26T12:44:42Z

I've added tests for the scatter/gather ops (well, I generated them, but I looked over them and they cover all the edge cases). Some of these are #[should_panic] tests.

I've also made some supporting changes to the simd_test macro: it now just copies all input attributes to the generated test functions (I've removed the individual ignore functions since they've been no-ops for a while), and if it detects a #[should_panic] attribute on a test function, it runs the test using the fallback implementation even if the target CPU features aren't supported.

Surprisingly, cargo-nextest can actually run #[should_panic] tests even on targets like wasm32-wasip1 that have panic=abort, since each test is run in its own process.

I also noticed that I forgot to add SimdGather and SimdScatter bounds to the native-width u8s, u16s, and u32s types, so I did that as well.

Finally, I removed the unnecessary &*inbounds. I thought these were necessary for being able to call get_unchecked on SIMD types directly, but I think the Deref impl takes care of that.

valadaptive · 2026-01-19T15:03:06Z

Before merging this, I want to get to the bottom of what's going on in vello_cpu.

The purpose of these operations was to speed up GradientPainter's paint_u8 method, which currently does a ton of bounds-checked lookups into a gradient LUT.

The experimental "use latest fearless_simd" Vello branch is slower due to some unexplained code generation differences. The fine/gradient/linear/opaque_u8_neon benchmark is 490 ns on main, but 518 ns with the latest fearless_simd.

Changing paint_u8 to use unchecked lookups seems to speed both up to 460 ns. However, using the gather operation instead makes things even slower (525 ns). This is surprising, since it should be doing the exact same thing under the hood.

As such, before merging this, I think we should figure out if something is wrong with the generated code, and try to get that Vello gradient benchmark running at the same speed as it does with unchecked lookups.

Shnatsel · 2026-01-21T10:41:08Z

fearless_simd_gen/src/mk_simd_types.rs

+                            // Converting `src.len() - 1` to `Self::Element` will not wrap, because if `src.len() - 1 >=
+                            // Self::Element::MAX`, that means that `src.len() > Self::Element::MAX`, and we take the
+                            // above branch instead.
+                            self.simd.#min_method(self, ((src.len() - 1) as Self::Element).simd_into(self.simd))


Running min() on every access seems slow. Can we instead find the maximum index in the vector once, cache it, and only do a single bounds check on every load for the whole vector by comparing SimdGather::max_index() against src.len()? This means SimdGather needs to be a struct instead of a trait but I don't think that's a problem.

I've described such a design in more detail here: okaneco/safe_unaligned_simd#37

It's not quite that bad; we're doing it once per gather instead of once per access.

As I understand it, once you've constructed a GatherIndices, you can use it for multiple gather operations. Do you have a use case in mind for this? In e.g. Vello, I'd expect the indices to vary while the lookup table remains constant. You're describing an optimization that would improve performance when the lookup table varies but the indices remain constant.

I think your assessment is correct. I wanted to use it to load pixels from in-memory representation of RGBRGBRGBRGB into vectors RRRR, GGGG, BBBB.

This was when working on vstroebel/jpeg-encoder#17 or vstroebel/jpeg-encoder#18, I can't remember which one. But that crate doesn't use fearless_simd and doesn't have plans to, so this is pretty far down on my priority list.

Also as discussed in okaneco/safe_unaligned_simd#37 scatter/gather instructions are a performance minefield. LLVM will sometimes transform scalar loads into gather and tank performance that way.

valadaptive requested review from DJMcNab and LaurenzV December 15, 2025 23:55

valadaptive force-pushed the gather-scatter branch from 1e0b052 to 0ab5e31 Compare December 17, 2025 23:12

LaurenzV mentioned this pull request Dec 21, 2025

perf: eliminate overdraw for opaque image fills linebender/vello#1327

Merged

LaurenzV reviewed Dec 22, 2025

View reviewed changes

valadaptive force-pushed the gather-scatter branch from b8d9750 to e4fd481 Compare December 24, 2025 22:05

LaurenzV approved these changes Dec 26, 2025

View reviewed changes

valadaptive requested a review from LaurenzV December 26, 2025 12:44

LaurenzV approved these changes Dec 26, 2025

View reviewed changes

LaurenzV mentioned this pull request Jan 18, 2026

Add methods for storing vectors back to memory #181

Merged

valadaptive added 10 commits January 19, 2026 10:00

Add (scalarized) scatter/gather ops

eaa2808

Use unchecked indexing in scatter/gather

06c7e93

Fix assert message

8b82d0c

Update size_of check

16efc6b

the blasted paperclip

e53ee9d

Copy over simd_test attrs

a36e380

Add SimdGather+SimdScatter native-width bounds

c3aedf2

Add tests for gather+scatter

0ef8db0

Remove &*inbounds

fa0a66d

Run #[should_panic] tests with fallback impl

e0927a0

valadaptive force-pushed the gather-scatter branch from ca56c31 to e0927a0 Compare January 19, 2026 15:02

Shnatsel reviewed Jan 21, 2026

View reviewed changes

Add (scalarized) safe gather and scatter ops #171

Are you sure you want to change the base?

Add (scalarized) safe gather and scatter ops #171

Conversation

valadaptive commented Dec 15, 2025

Uh oh!

Uh oh!

Uh oh!

LaurenzV Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

valadaptive commented Dec 24, 2025

Uh oh!

LaurenzV left a comment

Choose a reason for hiding this comment

Uh oh!

LaurenzV Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

LaurenzV Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

valadaptive commented Dec 26, 2025

Uh oh!

valadaptive commented Jan 19, 2026

Uh oh!

Shnatsel Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

valadaptive Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Shnatsel Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Shnatsel Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Shnatsel Jan 21, 2026 •

edited

Loading