Skip to content

Conversation

@grebmeg
Copy link
Collaborator

@grebmeg grebmeg commented Dec 18, 2025

By tracking image opacities, we can skip alpha blending for fully opaque image fills and directly override blend_buf (code here), resulting in a notable performance gain over the main branch (the change here adds about ~24% improvement):

  • Low quality:
image

@grebmeg grebmeg force-pushed the gemberg/tests-diff-log-data branch from 52be151 to d9ec9a2 Compare December 22, 2025 01:31
Base automatically changed from gemberg/tests-diff-log-data to main December 22, 2025 02:03
@grebmeg grebmeg changed the base branch from main to gemberg/perf/image-rendering-improvements December 22, 2025 02:33
@grebmeg grebmeg force-pushed the gemberg/perf/image-rendering-improvements branch from 2d4526c to 422ece4 Compare December 22, 2025 02:35
@grebmeg grebmeg force-pushed the gemberg/gemberg/perf/image-rendering-improvements2 branch from 72652c0 to 10b89a6 Compare December 22, 2025 02:45
@grebmeg grebmeg force-pushed the gemberg/perf/image-rendering-improvements branch from 422ece4 to 5e74593 Compare December 29, 2025 23:37
Base automatically changed from gemberg/perf/image-rendering-improvements to main December 29, 2025 23:53
@grebmeg grebmeg force-pushed the gemberg/gemberg/perf/image-rendering-improvements2 branch from 10b89a6 to c398a62 Compare December 30, 2025 00:01
@grebmeg grebmeg force-pushed the gemberg/gemberg/perf/image-rendering-improvements2 branch from c398a62 to facff60 Compare January 5, 2026 00:14
@grebmeg grebmeg requested a review from LaurenzV January 5, 2026 00:58
// Widen to u16, then compute `256 - fx` to ensure fx + fx_inv = 256.
let fx = self.simd.widen_u8x16(fx);
let fy = self.simd.widen_u8x16(fy);
let fx_inv = u16x16::splat(self.simd, 256) - fx;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure it's correct to use 256 here? I believe the previous 255 was correct.

Just to explain, in the expression fract(y_positions + 0.5) * 256.0, the expression fract(y_positions + 0.5) will yield a value in the range [0.0, 1.0) (note that 1.0 is exclusive), and therefore it will be mapped to the range [0.0, 256.0), so if we end up with something like 255.5, it will be clamped to 255 after converting to u8. The inverse should then be 0, because the overall value is supposed to be at most 255 (the maximum value of u8), not 256. It's been a while since I wrote this code, but I think that should be right, or am I missing something?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say for example that we are sampling the exact center of the pixel (0.5, 0.5). In this case, fx ends up being 0, and using your code fx_inv would be 256. If we are now sampling a white pixel, we would then be calculating 256 * 255, which would overflow the u16. So I think it should be right that fx + fx_inv = 255. 🤔

let ip1 = (p00 * fx_inv + p10 * fx) >> 8;
let ip2 = (p01 * fx_inv + p11 * fx) >> 8;
let res = self.simd.narrow_u16x16((ip1 * fy_inv + ip2 * fy) >> 8);
// Add rounding bias before shifting: round(x/256) = floor((x + 128) / 256).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I wrote this code I was aware that >> 8 is technically not correct, because we really want to divide by 255 and not 256, but I didn't consider it as critical since there is going to be a lot of imprecision anyway from the various calculations that are performed, and doing the shift is much faster.

So, if we really want to fix this, I think the correct approach would be to use the div_255 method instead of bit-shifting. However, it is probably worth checking how much this slows down the benchmarks. What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used 256 because it allows a simple >> 8 shift for division, avoiding the need for div_255().

There’s no overflow risk, the maximum intermediate value is 256 × 255 + 128 = 65,408, which fits within u16. The +128 rounding bias compensates for some precision loss, since round(x / 256) = floor((x + 128) / 256).

You’re right that this introduces a small asymmetry at the edges, when fract() → 1.0, fx caps at 255 due to u8 clamping, but this should be imperceptible in practice.

I benchmarked your snippet, it behaves correctly and passes the tests, but it introduces a ~2.5% regression. Given that, what’s your opinion on whether a more precise but slightly slower approach is preferable, or if the faster, less precise version is acceptable here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah true, I got the overflow part wrong, my bad!

@LaurenzV
Copy link
Collaborator

LaurenzV commented Jan 5, 2026

I think this is how it should look like, what do you think?

let fx = f32_to_u8(element_wise_splat(
    self.simd,
    fract(x_positions + 0.5).madd(255.0, 0.5),
));
let fy = f32_to_u8(element_wise_splat(
    self.simd,
    fract(y_positions + 0.5).madd(255.0, 0.5),
));

let fx = self.simd.widen_u8x16(fx);
let fy = self.simd.widen_u8x16(fy);
let fx_inv = u16x16::splat(self.simd, 255) - fx;
let fy_inv = u16x16::splat(self.simd, 255) - fy;

let x_pos1 = extend_x(x_positions - 0.5);
let x_pos2 = extend_x(x_positions + 0.5);
let y_pos1 = extend_y(y_positions - 0.5);
let y_pos2 = extend_y(y_positions + 0.5);

let p00 = self
    .simd
    .widen_u8x16(sample(self.simd, &self.data, x_pos1, y_pos1));
let p10 = self
    .simd
    .widen_u8x16(sample(self.simd, &self.data, x_pos2, y_pos1));
let p01 = self
    .simd
    .widen_u8x16(sample(self.simd, &self.data, x_pos1, y_pos2));
let p11 = self
    .simd
    .widen_u8x16(sample(self.simd, &self.data, x_pos2, y_pos2));

let ip1 = (p00 * fx_inv + p10 * fx).div_255();
let ip2 = (p01 * fx_inv + p11 * fx).div_255();
let res = self
    .simd
    .narrow_u16x16((ip1 * fy_inv + ip2 * fy).div_255());

I tried it locally and the tests still seem to pass.

@grebmeg grebmeg requested a review from LaurenzV January 5, 2026 23:33
Copy link
Collaborator

@LaurenzV LaurenzV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will leave it up to you if you want to change back to 256, but I think 2.5% is small enough that it's better to just use the more accurate version!

@grebmeg grebmeg force-pushed the gemberg/gemberg/perf/image-rendering-improvements2 branch from d1a62b3 to 0413d9e Compare January 12, 2026 06:48
@grebmeg grebmeg added this pull request to the merge queue Jan 12, 2026
Merged via the queue into main with commit 7fa709f Jan 12, 2026
17 checks passed
@grebmeg grebmeg deleted the gemberg/gemberg/perf/image-rendering-improvements2 branch January 12, 2026 07:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants