-
Notifications
You must be signed in to change notification settings - Fork 22
Don't use load/store intrinsics #185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
This is a massive improvement on Apple M4! This takes |
|
Here's what I get with vello_cpu main from December vs. vello_cpu with this PR and #171, it seems to yield improvements in some cases but unfortunately still regressions in others. :( But if it alleviates problems in other benchmarks we can still merge it. |
|
So you mean basically reverting #184? Seems to be about the same unfortunately. |
|
So just for the record: Before (using fearless_simd 0.3): impl<S: Simd> Iterator for GradientPainter<'_, S> {
type Item = u8x64<S>;
#[inline(always)]
fn next(&mut self) -> Option<Self::Item> {
let extend = self.gradient.extend;
let pos = f32x16::from_slice(self.simd, self.t_vals.next()?);
let t_vals = apply_extend(pos, extend);
let indices = (t_vals * self.scale_factor).to_int::<u32x16<S>>();
let mut vals = [0_u8; 64];
for (val, idx) in vals.chunks_exact_mut(4).zip(*indices) {
val.copy_from_slice(&self.lut[idx as usize]);
}
Some(u8x64::from_slice(self.simd, &vals))
}
}
impl<S: Simd> crate::fine::Painter for GradientPainter<'_, S> {
#[inline(never)]
fn paint_u8(&mut self, buf: &mut [u8]) {
self.simd.vectorize(
#[inline(always)]
|| {
for chunk in buf.chunks_exact_mut(64) {
chunk.copy_from_slice(self.next().unwrap().as_slice());
}
},
);
}
fn paint_f32(&mut self, _: &mut [f32]) {
unimplemented!()
}
}After (using fearless_simd main + this PR + #171): impl<S: Simd> crate::fine::Painter for GradientPainter<'_, S> {
#[inline(never)]
fn paint_u8(&mut self, buf: &mut [u8]) {
self.simd.vectorize(
#[inline(always)]
|| {
let src: &[u32] = bytemuck::cast_slice(&self.lut);
let dest: &mut [u32] = bytemuck::cast_slice_mut(buf);
for chunk in dest.chunks_exact_mut(16) {
let extend = self.gradient.extend;
let pos = f32x16::from_slice(self.simd, self.t_vals.next().unwrap());
let t_vals = apply_extend(pos, extend);
let indices = (t_vals * self.scale_factor).to_int::<u32x16<S>>();
indices.gather_into(src, chunk);
}
},
);
}
fn paint_f32(&mut self, _: &mut [f32]) {
unimplemented!()
}
}Here are the assemblies: https://gist.github.com/LaurenzV/a5ed17df074e7de3eed2b96c41f121d2 For |
|
This is before/after this PR, or are there some other changes that are also included? |
|
If I change it back to not use the new #[inline(never)]
fn paint_u8(&mut self, buf: &mut [u8]) {
self.simd.vectorize(
#[inline(always)]
|| {
self.simd.vectorize(
#[inline(always)]
|| {
for chunk in buf.chunks_exact_mut(64) {
let extend = self.gradient.extend;
let pos = f32x16::from_slice(self.simd, self.t_vals.next().unwrap());
let t_vals = apply_extend(pos, extend);
let indices = (t_vals * self.scale_factor).to_int::<u32x16<S>>();
for (val, idx) in chunk.chunks_exact_mut(4).zip(*indices) {
val.copy_from_slice(&self.lut[idx as usize]);
}
}
},
);
},
);
} |
Before is using fearless_simd 0.3, After is using fearless_simd main + this PR + #171 |
Could you share the assembly for this case? |
Here: |
|
Ah, yeah, it's the bounds-checks-on-gather case again. The compiler just so happens to structure the load loop differently. This seems to be incidental to the use of intrinsics. I am not very good at reading Aarch64 assembly but Gemini has a convincing explanation of what's happening. This PR fixes awful codegen for loading contiguous data into vector types. On main So on balance I think this is well worth merging: it fixes some really awful codegen for the most common case, and the regression for the fill case seems entirely incidental and will likely change from one LLVM version to another anyway. |
|
Sure, do the changes with |
|
I've looked at the documentation for I don't really follow what's happening in the stores, all that PhastFT tests pass on this branch so it's certainly not grossly wrong. |
|
@valadaptive if I use fine/gradient/linear/opaque_u8_neon
time: [473.02 ns 476.11 ns 482.30 ns]
change: [-2.6771% -1.3268% +0.4153%] (p = 0.05 > 0.05)
No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
2 (2.00%) low mild
3 (3.00%) high severeStill a bit worse than current main, but I think at this point I can live with the difference! |
At the risk of getting into micro-optimization, is there any benefit if you split up the If it turns out to be beneficial, I could implement it in I might take a look at LLVM at some point, but it's not really an area of it that I've looked into before. |
|
With benchmarks all on par or improved and safety assured, is this good to go? I'm excited to see this merged because it's the last blocker for merging the migration of https://github.com/QuState/PHastFT/ from |
|
I’ll look at it tomorrow |
This is an interesting one! The remaining performance gap in QuState/PhastFT#58 seems to come from subpar performance when loading constants.
I noticed that in Rust's
stdarch, which defines all the SIMD intrinsics, the x86 load/store intrinsics lower to raw memory operations (ptr::copy_nonoverlapping). The AArch64 load/store intrinsics, on the other hand, do map to corresponding LLVM intrinsics!My hypothesis is that the LLVM intrinsics are not lowered until much later in the compilation pipeline, resulting in much fewer optimization opportunities and much worse codegen. If this is the case, we should just use memory operations directly. This also simplifies the code that we generate by quite a bit.