-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarking research: deinterlacing rgb(a) buffers is the slow part #192
Comments
I think it's amazing that you took the time to go and investigate! thanks. there is a batteries included storage type that stores the samples for each channel in a separate vector. It uses memcpy to read and write files. Look for
|
is your goal to make |
also, we are currently working on performing this interleaving on multiple threads |
Ah, great! I didn't know that there was some work that had already happened for this.
At the moment I'm just investigating how fast I can make things. This might become a PR at some point. |
Hmm, where does interlacing or deinterlacing happen? The reading path for uncompressed scanlines takes this path: Lines 194 to 206 in a3004cc
|
I did more benchmarking with
I think something to investigate is why doing a manual deinterlacing step is so much faster than using |
I changed things such that all data is allocated in the benchmarking closure instead of allowing some if it to allocated outside and things look a lot closer!
|
It would be great if we could optimized the existing code, awesome :) |
See #178 in the comments. It's part of the |
this part is important:
it's roughly in this part of the code: exrs/src/image/read/specific_channels.rs Lines 290 to 302 in 49fece0
|
Ah, okay cool :) Honestly I think the conclusion that I've drawn from doing these benchmarks is that this crate is doing things correctly when it comes to performance :) There are a few differences between the existing code and the function I've written, but they mostly come down to my code being a bit more straightforward. One idea I did test was that if the input rgb(a) buffer is mutable, either a fn deinterlace_rgba_rows_inplace(rgba: &mut [f32], width: usize, height: usize) {
let mut row = vec![0.0; width * 3];
for y in 0 .. height {
let mut i = 0;
{
let input = &rgba[y * width * 4 .. (y + 1) * width * 4];
let (mut blue, mut remaining) = row.split_at_mut(width);
let (mut green, mut red) = remaining.split_at_mut(width);
for (((chunk, red), green), blue) in input.chunks_exact(4).zip(red).zip(green).zip(blue) {
*red = chunk[0];
*green = chunk[1];
*blue = chunk[2];
}
}
rgba[y * width * 3..(y +1) * width * 3].copy_from_slice(&row);
}
} This only allocates a row's worth of pixels which is quicker. Additionally the data for each row can be written out with a single memory copy (apart from the row headers). This has good benchmark performance:
This requirement of the input being mutable does make things harder though - if you cloned a |
thanks! appreciate it. we should not stop to profile though. currently the efforts are focused on doing all the work in multiple threads, including interlacing, which will help :) unfortunately deinterlacing in-place also requires all channels to have the same sample type. it might be worth to optimize for that since it is the most common case, but in general the library must support arbitrary channels. also in general, the library does not assume any pixel storage, it will not assume that the user stores their pixels in a vector in the first place. that's why the API is built with a callback of the form |
Note that interlacing and deinterlacing has been optimized a lot in #173, which is not yet part of any stable release. Make sure you're profiling the latest version from git, so that you get the optimized implementations. |
Since no action appears to be is needed right now, should this be closed or moved to a discussion? |
Thanks for taking the time to tinker, benchmark, and letting us know about your results, @expenses :) I'll enable discussions as a place to collect detailed insights |
Hi! I just did some benchmarking on what kind of potential performance gains can be achieved when writing to uncompressed scanline files. I did this by writing an alternative function that writes the metadata and header as per-normal, but manually computes and writes the offset table and lines.
You can see the benchmarks here: master...expenses:exrs:benchmarking. Essentially, the big problem is that scanlines in exr files are stored in a deinterlaced fashion in alphabetical channel order, so you have something like
<line header>BBBBGGGGRRRR<line header>BBBBGGGGRRRR
. If you have deinterlaced channels as input, you can write files super fast as you're just doing 3 big writes per-line. When the input is interlaced, you have to do deinterlacing, which might take up half of the writing time.Here are the benchmark results for me:
The text was updated successfully, but these errors were encountered: