-
Notifications
You must be signed in to change notification settings - Fork 210
perf(vello_common): track has_opacities to skip alpha blending #1329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(vello_common): track has_opacities to skip alpha blending #1329
Conversation
52be151 to
d9ec9a2
Compare
2d4526c to
422ece4
Compare
72652c0 to
10b89a6
Compare
422ece4 to
5e74593
Compare
10b89a6 to
c398a62
Compare
c398a62 to
facff60
Compare
| // Widen to u16, then compute `256 - fx` to ensure fx + fx_inv = 256. | ||
| let fx = self.simd.widen_u8x16(fx); | ||
| let fy = self.simd.widen_u8x16(fy); | ||
| let fx_inv = u16x16::splat(self.simd, 256) - fx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure it's correct to use 256 here? I believe the previous 255 was correct.
Just to explain, in the expression fract(y_positions + 0.5) * 256.0, the expression fract(y_positions + 0.5) will yield a value in the range [0.0, 1.0) (note that 1.0 is exclusive), and therefore it will be mapped to the range [0.0, 256.0), so if we end up with something like 255.5, it will be clamped to 255 after converting to u8. The inverse should then be 0, because the overall value is supposed to be at most 255 (the maximum value of u8), not 256. It's been a while since I wrote this code, but I think that should be right, or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's say for example that we are sampling the exact center of the pixel (0.5, 0.5). In this case, fx ends up being 0, and using your code fx_inv would be 256. If we are now sampling a white pixel, we would then be calculating 256 * 255, which would overflow the u16. So I think it should be right that fx + fx_inv = 255. 🤔
| let ip1 = (p00 * fx_inv + p10 * fx) >> 8; | ||
| let ip2 = (p01 * fx_inv + p11 * fx) >> 8; | ||
| let res = self.simd.narrow_u16x16((ip1 * fy_inv + ip2 * fy) >> 8); | ||
| // Add rounding bias before shifting: round(x/256) = floor((x + 128) / 256). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I wrote this code I was aware that >> 8 is technically not correct, because we really want to divide by 255 and not 256, but I didn't consider it as critical since there is going to be a lot of imprecision anyway from the various calculations that are performed, and doing the shift is much faster.
So, if we really want to fix this, I think the correct approach would be to use the div_255 method instead of bit-shifting. However, it is probably worth checking how much this slows down the benchmarks. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used 256 because it allows a simple >> 8 shift for division, avoiding the need for div_255().
There’s no overflow risk, the maximum intermediate value is 256 × 255 + 128 = 65,408, which fits within u16. The +128 rounding bias compensates for some precision loss, since round(x / 256) = floor((x + 128) / 256).
You’re right that this introduces a small asymmetry at the edges, when fract() → 1.0, fx caps at 255 due to u8 clamping, but this should be imperceptible in practice.
I benchmarked your snippet, it behaves correctly and passes the tests, but it introduces a ~2.5% regression. Given that, what’s your opinion on whether a more precise but slightly slower approach is preferable, or if the faster, less precise version is acceptable here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah true, I got the overflow part wrong, my bad!
|
I think this is how it should look like, what do you think? let fx = f32_to_u8(element_wise_splat(
self.simd,
fract(x_positions + 0.5).madd(255.0, 0.5),
));
let fy = f32_to_u8(element_wise_splat(
self.simd,
fract(y_positions + 0.5).madd(255.0, 0.5),
));
let fx = self.simd.widen_u8x16(fx);
let fy = self.simd.widen_u8x16(fy);
let fx_inv = u16x16::splat(self.simd, 255) - fx;
let fy_inv = u16x16::splat(self.simd, 255) - fy;
let x_pos1 = extend_x(x_positions - 0.5);
let x_pos2 = extend_x(x_positions + 0.5);
let y_pos1 = extend_y(y_positions - 0.5);
let y_pos2 = extend_y(y_positions + 0.5);
let p00 = self
.simd
.widen_u8x16(sample(self.simd, &self.data, x_pos1, y_pos1));
let p10 = self
.simd
.widen_u8x16(sample(self.simd, &self.data, x_pos2, y_pos1));
let p01 = self
.simd
.widen_u8x16(sample(self.simd, &self.data, x_pos1, y_pos2));
let p11 = self
.simd
.widen_u8x16(sample(self.simd, &self.data, x_pos2, y_pos2));
let ip1 = (p00 * fx_inv + p10 * fx).div_255();
let ip2 = (p01 * fx_inv + p11 * fx).div_255();
let res = self
.simd
.narrow_u16x16((ip1 * fy_inv + ip2 * fy).div_255());I tried it locally and the tests still seem to pass. |
LaurenzV
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will leave it up to you if you want to change back to 256, but I think 2.5% is small enough that it's better to just use the more accurate version!
d1a62b3 to
0413d9e
Compare
By tracking image opacities, we can skip alpha blending for fully opaque image fills and directly override
blend_buf(code here), resulting in a notable performance gain over the main branch (the change here adds about ~24% improvement):