Skip to content

Conversation

@valadaptive
Copy link
Contributor

This is an interesting one! The remaining performance gap in QuState/PhastFT#58 seems to come from subpar performance when loading constants.

I noticed that in Rust's stdarch, which defines all the SIMD intrinsics, the x86 load/store intrinsics lower to raw memory operations (ptr::copy_nonoverlapping). The AArch64 load/store intrinsics, on the other hand, do map to corresponding LLVM intrinsics!

My hypothesis is that the LLVM intrinsics are not lowered until much later in the compilation pipeline, resulting in much fewer optimization opportunities and much worse codegen. If this is the case, we should just use memory operations directly. This also simplifies the code that we generate by quite a bit.

@Shnatsel
Copy link
Contributor

Shnatsel commented Jan 23, 2026

This is a massive improvement on Apple M4! This takes fearless_simd from ~10% slower than wide to 10-40% faster, with a 30% speedup on most sizes in QuState/PhastFT#58

@LaurenzV
Copy link
Collaborator

Here's what I get with vello_cpu main from December vs. vello_cpu with this PR and #171, it seems to yield improvements in some cases but unfortunately still regressions in others. :( But if it alleviates problems in other benchmarks we can still merge it.

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/main.rs (/Users/lstampfl/Programming/GitHub/vello/target/release/deps/main-b47aadb4df1020e4)
fine/fill/opaque_short_u8_neon
                        time:   [7.4267 ns 7.4334 ns 7.4406 ns]
                        change: [-24.443% -22.346% -20.206%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe

fine/fill/opaque_long_u8_neon
                        time:   [43.211 ns 43.404 ns 43.646 ns]
                        change: [-35.947% -33.779% -31.660%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) high mild
  5 (5.00%) high severe

fine/fill/transparent_short_u8_neon
                        time:   [19.305 ns 20.180 ns 21.145 ns]
                        change: [+9.3696% +12.799% +16.477%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 17 outliers among 100 measurements (17.00%)
  17 (17.00%) high severe

fine/fill/transparent_long_u8_neon
                        time:   [123.57 ns 123.76 ns 123.99 ns]
                        change: [+16.085% +16.371% +16.655%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

fine/strip/solid_short_u8_neon
                        time:   [13.104 ns 13.117 ns 13.133 ns]
                        change: [-6.1164% -5.9475% -5.7633%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  9 (9.00%) high mild
  3 (3.00%) high severe

fine/strip/solid_long_u8_neon
                        time:   [78.589 ns 78.656 ns 78.734 ns]
                        change: [+2.2898% +2.6359% +2.9400%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  8 (8.00%) high mild
  5 (5.00%) high severe

fine/pack/pack_block_u8_neon
                        time:   [62.368 ns 62.406 ns 62.448 ns]
                        change: [-1.8640% -1.6525% -1.4733%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe

fine/pack/pack_regular_u8_neon
                        time:   [203.74 ns 203.89 ns 204.06 ns]
                        change: [-1.0042% -0.7988% -0.5463%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high severe

fine/gradient/linear/opaque_u8_neon
                        time:   [529.87 ns 530.74 ns 531.77 ns]
                        change: [+19.602% +19.883% +20.185%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe

fine/gradient/radial/opaque_u8_neon
                        time:   [698.40 ns 698.89 ns 699.45 ns]
                        change: [+13.633% +13.964% +14.262%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

fine/gradient/radial/opaque_conical_u8_neon
                        time:   [849.08 ns 849.79 ns 850.64 ns]
                        change: [+13.747% +14.030% +14.282%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

fine/gradient/sweep/opaque_u8_neon
                        time:   [1.1623 µs 1.1631 µs 1.1640 µs]
                        change: [+8.1183% +8.3233% +8.5196%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  7 (7.00%) high severe

fine/gradient/extend/pad_u8_neon
                        time:   [528.84 ns 529.27 ns 529.76 ns]
                        change: [+20.171% +20.434% +20.690%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  6 (6.00%) high mild
  6 (6.00%) high severe

fine/gradient/extend/repeat_u8_neon
                        time:   [657.77 ns 658.34 ns 659.05 ns]
                        change: [+16.606% +16.917% +17.247%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  6 (6.00%) high mild
  10 (10.00%) high severe

fine/gradient/extend/reflect_u8_neon
                        time:   [761.78 ns 763.70 ns 766.09 ns]
                        change: [+18.826% +19.311% +19.824%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe

fine/gradient/many_stops_u8_neon
                        time:   [813.23 ns 813.84 ns 814.60 ns]
                        change: [+10.621% +10.903% +11.206%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) high mild
  8 (8.00%) high severe

fine/gradient/transparent_u8_neon
                        time:   [680.01 ns 680.46 ns 681.00 ns]
                        change: [+10.975% +11.202% +11.419%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe

fine/image/transform/none_u8_neon
                        time:   [486.06 ns 486.43 ns 486.89 ns]
                        change: [-4.2720% -4.0379% -3.8356%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low mild
  8 (8.00%) high mild
  5 (5.00%) high severe

fine/image/transform/scale_u8_neon
                        time:   [486.30 ns 487.30 ns 488.38 ns]
                        change: [-4.0233% -3.7471% -3.4756%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  9 (9.00%) high mild
  5 (5.00%) high severe

fine/image/transform/rotate_u8_neon
                        time:   [685.88 ns 686.59 ns 687.42 ns]
                        change: [-0.6023% -0.3584% -0.1247%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  7 (7.00%) high mild
  6 (6.00%) high severe

fine/image/quality/low_u8_neon
                        time:   [484.99 ns 485.39 ns 485.85 ns]
                        change: [-4.4166% -4.1752% -3.9803%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

fine/image/quality/medium_u8_neon
                        time:   [2.9297 µs 2.9323 µs 2.9353 µs]
                        change: [+2.0053% +2.1761% +2.3413%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

fine/image/quality/high_u8_neon
                        time:   [12.502 µs 12.511 µs 12.523 µs]
                        change: [+7.4495% +7.6459% +7.8318%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  6 (6.00%) high mild
  6 (6.00%) high severe

fine/image/extend/pad_u8_neon
                        time:   [485.35 ns 486.17 ns 487.14 ns]
                        change: [-4.3114% -4.0895% -3.8692%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

fine/image/extend/repeat_u8_neon
                        time:   [553.19 ns 553.58 ns 554.07 ns]
                        change: [-0.3337% -0.0878% +0.1720%] (p = 0.50 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

fine/image/extend/reflect_u8_neon
                        time:   [827.97 ns 852.81 ns 885.82 ns]
                        change: [+6.1852% +15.152% +25.318%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  3 (3.00%) high mild
  12 (12.00%) high severe

@Shnatsel
Copy link
Contributor

@LaurenzV could you compare this PR against the commit ccf4763 on main? That's the latest one before #184.

I think it would be valuable to look at the effect of removing the use of intrinsics for loads in isolation, without rolling up all the other changes made since December.

@LaurenzV
Copy link
Collaborator

So you mean basically reverting #184? Seems to be about the same unfortunately.

@LaurenzV
Copy link
Collaborator

LaurenzV commented Jan 23, 2026

So just for the record:

Before (using fearless_simd 0.3):

impl<S: Simd> Iterator for GradientPainter<'_, S> {
    type Item = u8x64<S>;

    #[inline(always)]
    fn next(&mut self) -> Option<Self::Item> {
        let extend = self.gradient.extend;
        let pos = f32x16::from_slice(self.simd, self.t_vals.next()?);
        let t_vals = apply_extend(pos, extend);
        let indices = (t_vals * self.scale_factor).to_int::<u32x16<S>>();

        let mut vals = [0_u8; 64];
        for (val, idx) in vals.chunks_exact_mut(4).zip(*indices) {
            val.copy_from_slice(&self.lut[idx as usize]);
        }

        Some(u8x64::from_slice(self.simd, &vals))
    }
}

impl<S: Simd> crate::fine::Painter for GradientPainter<'_, S> {
    #[inline(never)]
    fn paint_u8(&mut self, buf: &mut [u8]) {
        self.simd.vectorize(
            #[inline(always)]
            || {
                for chunk in buf.chunks_exact_mut(64) {
                    chunk.copy_from_slice(self.next().unwrap().as_slice());
                }
            },
        );
    }

    fn paint_f32(&mut self, _: &mut [f32]) {
        unimplemented!()
    }
}
     Running benches/main.rs (/Users/lstampfl/Programming/GitHub/vello/target/release/deps/main-95c585999c9437b3)
fine/gradient/linear/opaque_u8_neon
                        time:   [441.77 ns 442.81 ns 444.06 ns]
                        change: [-16.527% -16.162% -15.780%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  9 (9.00%) high mild
  6 (6.00%) high severe

After (using fearless_simd main + this PR + #171):

impl<S: Simd> crate::fine::Painter for GradientPainter<'_, S> {
    #[inline(never)]
    fn paint_u8(&mut self, buf: &mut [u8]) {
        self.simd.vectorize(
            #[inline(always)]
            || {
                let src: &[u32] = bytemuck::cast_slice(&self.lut);
                let dest: &mut [u32] = bytemuck::cast_slice_mut(buf);
                
                for chunk in dest.chunks_exact_mut(16) {
                    let extend = self.gradient.extend;
                    let pos = f32x16::from_slice(self.simd, self.t_vals.next().unwrap());
                    let t_vals = apply_extend(pos, extend);
                    let indices = (t_vals * self.scale_factor).to_int::<u32x16<S>>();
                    indices.gather_into(src, chunk);
                }
            },
        );
    }

    fn paint_f32(&mut self, _: &mut [f32]) {
        unimplemented!()
    }
}
     Running benches/main.rs (/Users/lstampfl/Programming/GitHub/vello/target/release/deps/main-b47aadb4df1020e4)
fine/gradient/linear/opaque_u8_neon
                        time:   [529.80 ns 530.41 ns 531.13 ns]
                        change: [+18.652% +19.229% +19.748%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

Here are the assemblies: https://gist.github.com/LaurenzV/a5ed17df074e7de3eed2b96c41f121d2

For before, there are 2, one for the Fallback dispatch and one for the Neon one, unfortunately I'm not sure which one is which.

@Shnatsel
Copy link
Contributor

This is before/after this PR, or are there some other changes that are also included?

@LaurenzV
Copy link
Collaborator

If I change it back to not use the new gather, After is slightly better but still not the same as before:

    #[inline(never)]
    fn paint_u8(&mut self, buf: &mut [u8]) {
        self.simd.vectorize(
            #[inline(always)]
            || {
                self.simd.vectorize(
                    #[inline(always)]
                    || {
                        for chunk in buf.chunks_exact_mut(64) {
                            let extend = self.gradient.extend;
                            let pos = f32x16::from_slice(self.simd, self.t_vals.next().unwrap());
                            let t_vals = apply_extend(pos, extend);
                            let indices = (t_vals * self.scale_factor).to_int::<u32x16<S>>();

                            for (val, idx) in chunk.chunks_exact_mut(4).zip(*indices) {
                                val.copy_from_slice(&self.lut[idx as usize]);
                            }
                        }
                    },
                );
            },
        );
    }
fine/gradient/linear/opaque_u8_neon
                        time:   [501.64 ns 502.35 ns 503.14 ns]
                        change: [-5.4730% -5.2248% -4.9749%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

@LaurenzV
Copy link
Collaborator

LaurenzV commented Jan 23, 2026

This is before/after this PR, or are there some other changes that are also included?

Before is using fearless_simd 0.3, After is using fearless_simd main + this PR + #171

@Shnatsel
Copy link
Contributor

If I change it back to not use the new gather, After is slightly better but still not the same as before:

Could you share the assembly for this case?

@LaurenzV
Copy link
Collaborator

LaurenzV commented Jan 23, 2026

Could you share the assembly for this case?

Here:

Working with file: /Users/lstampfl/Programming/GitHub/vello/target/release/deps/vello_cpu-f646e0ded71d97fc.s
        .globl  <vello_cpu::fine::lowp::gradient::GradientPainter<S> as vello_cpu::fine::Painter>::paint_u8
        .p2align        2
<vello_cpu::fine::lowp::gradient::GradientPainter<S> as vello_cpu::fine::Painter>::paint_u8:
Lfunc_begin1:
        .cfi_startproc
        sub sp, sp, #32
        .cfi_def_cfa_offset 32
        stp x29, x30, [sp, #16]
        add x29, sp, #16
        .cfi_def_cfa w29, 16
        .cfi_offset w30, -8
        .cfi_offset w29, -16
        .cfi_remember_state
        tst x2, #0xffffffffffffffc0
        b.eq LBB1_26
        ldr x11, [x0, #96]
        ldp x9, x10, [x0, #64]
        cmp x11, #16
        b.ne LBB1_27
        mov x8, x1
        ldp x1, x12, [x0, #112]
        ldr x11, [x0, #104]
        ldrb w12, [x12, #309]
        ldp q0, q1, [x0]
        ldp q2, q3, [x0, #32]
        and x13, x2, #0xffffffffffffffc0
        neg x13, x13
        add x14, x9, #64
        movi.2d v4, #0000000000000000
        fmov.4s v5, #1.00000000
        fmov.4s v6, #-1.00000000
        movi.4s v7, #63, lsl #24
LBB1_3:
        subs x10, x10, #16
        b.lo LBB1_28
        stp x14, x10, [x0, #64]
        ldp q19, q18, [x14, #-64]
        ldp q17, q16, [x14, #-32]
        cbz w12, LBB1_8
        cmp w12, #1
        b.ne LBB1_7
        frintm.4s v20, v19
        frintm.4s v21, v18
        frintm.4s v22, v17
        frintm.4s v23, v16
        fsub.4s v19, v19, v20
        fsub.4s v18, v18, v21
        fsub.4s v17, v17, v22
        fsub.4s v16, v16, v23
        fcvtzs.4s v20, v19
        scvtf.4s v20, v20
        fsub.4s v19, v19, v20
        fcvtzs.4s v20, v18
        scvtf.4s v20, v20
        fsub.4s v18, v18, v20
        fcvtzs.4s v20, v17
        scvtf.4s v20, v20
        fsub.4s v17, v17, v20
        fcvtzs.4s v20, v16
        scvtf.4s v20, v20
        fsub.4s v16, v16, v20
        b LBB1_9
LBB1_7:
        fadd.4s v19, v19, v6
        fadd.4s v18, v18, v6
        fadd.4s v17, v17, v6
        fadd.4s v16, v16, v6
        fmul.4s v20, v19, v7
        fmul.4s v21, v18, v7
        fmul.4s v22, v17, v7
        fmul.4s v23, v16, v7
        frintm.4s v20, v20
        frintm.4s v21, v21
        frintm.4s v22, v22
        frintm.4s v23, v23
        fadd.4s v20, v20, v20
        fadd.4s v21, v21, v21
        fadd.4s v22, v22, v22
        fadd.4s v23, v23, v23
        fsub.4s v19, v19, v20
        fsub.4s v18, v18, v21
        fsub.4s v17, v17, v22
        fsub.4s v16, v16, v23
        fadd.4s v19, v19, v6
        fadd.4s v18, v18, v6
        fadd.4s v17, v17, v6
        fadd.4s v16, v16, v6
        fabs.4s v19, v19
        fabs.4s v18, v18
        fabs.4s v17, v17
        fabs.4s v16, v16
LBB1_8:
        fmax.4s v19, v19, v4
        fmax.4s v18, v18, v4
        fmax.4s v17, v17, v4
        fmax.4s v16, v16, v4
        fmin.4s v19, v19, v5
        fmin.4s v18, v18, v5
        fmin.4s v17, v17, v5
        fmin.4s v16, v16, v5
LBB1_9:
        fmul.4s v19, v19, v0
        fcvtzu.4s v19, v19
        fmov w9, s19
        cmp x1, x9
        b.ls LBB1_29
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8]
        mov.s w9, v19[1]
        cmp x1, x9
        b.ls LBB1_29
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8, #4]
        mov.s w9, v19[2]
        cmp x1, x9
        b.ls LBB1_29
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8, #8]
        mov.s w9, v19[3]
        cmp x1, x9
        b.ls LBB1_29
        fmul.4s v18, v18, v1
        fcvtzu.4s v18, v18
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8, #12]
        fmov w9, s18
        cmp x1, x9
        b.ls LBB1_29
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8, #16]
        mov.s w9, v18[1]
        cmp x1, x9
        b.ls LBB1_29
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8, #20]
        mov.s w9, v18[2]
        cmp x1, x9
        b.ls LBB1_29
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8, #24]
        mov.s w9, v18[3]
        cmp x1, x9
        b.ls LBB1_29
        fmul.4s v17, v17, v2
        fcvtzu.4s v17, v17
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8, #28]
        fmov w9, s17
        cmp x1, x9
        b.ls LBB1_29
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8, #32]
        mov.s w9, v17[1]
        cmp x1, x9
        b.ls LBB1_29
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8, #36]
        mov.s w9, v17[2]
        cmp x1, x9
        b.ls LBB1_29
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8, #40]
        mov.s w9, v17[3]
        cmp x1, x9
        b.ls LBB1_29
        fmul.4s v16, v16, v3
        fcvtzu.4s v16, v16
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8, #44]
        fmov w9, s16
        cmp x1, x9
        b.ls LBB1_29
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8, #48]
        mov.s w9, v16[1]
        cmp x1, x9
        b.ls LBB1_29
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8, #52]
        mov.s w9, v16[2]
        cmp x1, x9
        b.ls LBB1_29
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8, #56]
        mov.s w9, v16[3]
        cmp x1, x9
        b.ls LBB1_29
        ldr w9, [x11, x9, lsl #2]
        str w9, [x8, #60]
        add x14, x14, #64
        add x8, x8, #64
        adds x13, x13, #64
        b.ne LBB1_3
LBB1_26:
        .cfi_def_cfa wsp, 32
        ldp x29, x30, [sp, #16]
        add sp, sp, #32
        .cfi_def_cfa_offset 0
        .cfi_restore w30
        .cfi_restore w29
        ret
LBB1_27:
        .cfi_restore_state
        subs x8, x10, x11
        b.hs LBB1_30
LBB1_28:
Lloh6:
        adrp x0, l_anon.a945abf46221ba9c4f7c92070933bb71.3@PAGE
Lloh7:
        add x0, x0, l_anon.a945abf46221ba9c4f7c92070933bb71.3@PAGEOFF
        bl core::option::unwrap_failed
LBB1_29:
Lloh8:
        adrp x2, l_anon.a945abf46221ba9c4f7c92070933bb71.6@PAGE
Lloh9:
        add x2, x2, l_anon.a945abf46221ba9c4f7c92070933bb71.6@PAGEOFF
        mov x0, x9
        bl core::panicking::panic_bounds_check
LBB1_30:
        add x9, x9, x11, lsl #2
        stp x9, x8, [x0, #64]
Lloh10:
        adrp x0, l_anon.a945abf46221ba9c4f7c92070933bb71.83@PAGE
Lloh11:
        add x0, x0, l_anon.a945abf46221ba9c4f7c92070933bb71.83@PAGEOFF
Lloh12:
        adrp x3, l_anon.a945abf46221ba9c4f7c92070933bb71.84@PAGE
Lloh13:
        add x3, x3, l_anon.a945abf46221ba9c4f7c92070933bb71.84@PAGEOFF
Lloh14:
        adrp x4, l_anon.a945abf46221ba9c4f7c92070933bb71.5@PAGE
Lloh15:
        add x4, x4, l_anon.a945abf46221ba9c4f7c92070933bb71.5@PAGEOFF
        sub x2, x29, #1
        mov w1, #43
        bl core::result::unwrap_failed
        .loh AdrpAdd    Lloh6, Lloh7
        .loh AdrpAdd    Lloh8, Lloh9
        .loh AdrpAdd    Lloh14, Lloh15
        .loh AdrpAdd    Lloh12, Lloh13
        .loh AdrpAdd    Lloh10, Lloh11

@Shnatsel
Copy link
Contributor

Shnatsel commented Jan 23, 2026

Ah, yeah, it's the bounds-checks-on-gather case again. The compiler just so happens to structure the load loop differently. This seems to be incidental to the use of intrinsics. I am not very good at reading Aarch64 assembly but Gemini has a convincing explanation of what's happening.

This PR fixes awful codegen for loading contiguous data into vector types. On main f32x16::simd_from(simd, input_slice); causes the emitted code to load data into registers, immediately spill it to the stack, and then load it back into registers again. This happens even for loading constants! vello_cpu clearly benefits from this fix in some parts too.

So on balance I think this is well worth merging: it fixes some really awful codegen for the most common case, and the regression for the fill case seems entirely incidental and will likely change from one LLVM version to another anyway.

@LaurenzV
Copy link
Collaborator

Sure, do the changes with transmute_copy look sound to you? I'll also take another closer look, but would be good to have your opinion as well.

@Shnatsel
Copy link
Contributor

I've looked at the documentation for transmute_copy and its use here seems sound.

I don't really follow what's happening in the stores, all that core::ptr::copy_nonoverlapping business.

PhastFT tests pass on this branch so it's certainly not grossly wrong.

@LaurenzV
Copy link
Collaborator

@valadaptive if I use gather instead of gather_into, I indeed get a better result:

fine/gradient/linear/opaque_u8_neon
                        time:   [473.02 ns 476.11 ns 482.30 ns]
                        change: [-2.6771% -1.3268% +0.4153%] (p = 0.05 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low mild
  3 (3.00%) high severe

Still a bit worse than current main, but I think at this point I can live with the difference!

@valadaptive
Copy link
Contributor Author

Still a bit worse than current main, but I think at this point I can live with the difference!

At the risk of getting into micro-optimization, is there any benefit if you split up the f32x16 into four blocks of f32x4, then gather+store each sequentially? Instead of doing 16 loads followed by 16 stores, this would result in 4 loads followed by 4 stores, 4 times in a row.

If it turns out to be beneficial, I could implement it in gather_into directly. Although it might take some tuning to figure out the best block size...does it depend on the element count? The native vector width? This is really the sort of thing that LLVM should be handling for us.

I might take a look at LLVM at some point, but it's not really an area of it that I've looked into before.

@Shnatsel
Copy link
Contributor

With benchmarks all on par or improved and safety assured, is this good to go?

I'm excited to see this merged because it's the last blocker for merging the migration of https://github.com/QuState/PHastFT/ from wide to fearless_simd.

@LaurenzV
Copy link
Collaborator

I’ll look at it tomorrow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants