Benchmarking research: deinterlacing rgb(a) buffers is the slow part #192

expenses · 2023-01-19T12:57:04Z

Hi! I just did some benchmarking on what kind of potential performance gains can be achieved when writing to uncompressed scanline files. I did this by writing an alternative function that writes the metadata and header as per-normal, but manually computes and writes the offset table and lines.

You can see the benchmarks here: master...expenses:exrs:benchmarking. Essentially, the big problem is that scanlines in exr files are stored in a deinterlaced fashion in alphabetical channel order, so you have something like <line header>BBBBGGGGRRRR<line header>BBBBGGGGRRRR. If you have deinterlaced channels as input, you can write files super fast as you're just doing 3 big writes per-line. When the input is interlaced, you have to do deinterlacing, which might take up half of the writing time.

Here are the benchmark results for me:

running 3 tests
test write_scanlines_deinterlaced            ... bench:  26,801,109 ns/iter (+/- 9,537,414)
test write_scanlines_interlaced              ... bench:  56,451,688 ns/iter (+/- 9,093,729)
test write_scanlines_normal                  ... bench:  93,277,199 ns/iter (+/- 18,158,968)

test result: ok. 0 passed; 0 failed; 0 ignored; 3 measured

The text was updated successfully, but these errors were encountered:

johannesvollmer · 2023-01-19T18:16:33Z

I think it's amazing that you took the time to go and investigate! thanks.

there is a batteries included storage type that stores the samples for each channel in a separate vector. It uses memcpy to read and write files. Look for all_channels in the examples, or any_channels. it should be pretty close to what you coded up, but maybe it's not as optimized as yours

exrs/examples/3b_read_all_channels_with_metadata.rs

Line 13 in a3004cc

.largest_resolution_level().all_channels().all_layers().all_attributes()

johannesvollmer · 2023-01-19T18:25:16Z

is your goal to make write_validating_to_buffered public? we can do that

johannesvollmer · 2023-01-19T21:59:59Z

also, we are currently working on performing this interleaving on multiple threads

expenses · 2023-01-20T08:31:12Z

there is a batteries included storage type that stores the samples for each channel in a separate vector. It uses memcpy to read and write files. Look for all_channels in the examples, or any_channels. it should be pretty close to what you coded up, but maybe it's not as optimized as yours
also, we are currently working on performing this interleaving on multiple threads

Ah, great! I didn't know that there was some work that had already happened for this.

is your goal to make write_validating_to_buffered public? we can do that

At the moment I'm just investigating how fast I can make things. This might become a PR at some point.

expenses · 2023-01-20T08:55:14Z

Hmm, where does interlacing or deinterlacing happen? The reading path for uncompressed scanlines takes this path:

exrs/src/compression/mod.rs

Lines 194 to 206 in a3004cc

    
               pub fn decompress_image_section(self, header: &Header, compressed: ByteVec, pixel_section: IntegerBounds, pedantic: bool) -> Result<ByteVec> { 
        
                   let max_tile_size = header.max_block_pixel_size(); 
        
                   assert!(pixel_section.validate(Some(max_tile_size)).is_ok(), "decompress tile coordinate bug"); 
        
                   if header.deep { assert!(self.supports_deep_data()) } 
        
                   let expected_byte_size = pixel_section.size.area() * header.channels.bytes_per_pixel; // FIXME this needs to account for subsampling anywhere 
        
                   // note: always true where self == Uncompressed 
        
                   if compressed.len() == expected_byte_size { 
        
                       // the compressed data was larger than the raw data, so the small raw data has been written 
        
                       Ok(convert_little_endian_to_current(&compressed, &header.channels, pixel_section)) 
        
                   }

and I don't see a function for this being called anywhere.

expenses · 2023-01-20T10:31:32Z

I did more benchmarking with AnyChannels! Looks like it's not much slower than my custom implementation.

test write_scanlines_deinterlaced_anychannels ... bench:  27,594,098 ns/iter (+/- 559,496)
test write_scanlines_deinterlaced_custom      ... bench:  23,301,687 ns/iter (+/- 715,609)
test write_scanlines_interlaced_anychannels   ... bench:  45,824,047 ns/iter (+/- 688,379)
test write_scanlines_interlaced_custom        ... bench:  41,103,504 ns/iter (+/- 1,111,189)
test write_scanlines_specificchannels         ... bench:  81,554,311 ns/iter (+/- 1,731,659)

I think something to investigate is why doing a manual deinterlacing step is so much faster than using SpecificChannels. It really shouldn't be.

expenses · 2023-01-20T12:05:41Z

I changed things such that all data is allocated in the benchmarking closure instead of allowing some if it to allocated outside and things look a lot closer!

test write_scanlines_deinterlaced_anychannels ... bench:  47,718,728 ns/iter (+/- 3,089,806)
test write_scanlines_deinterlaced_custom      ... bench:  43,331,516 ns/iter (+/- 1,696,782)
test write_scanlines_interlaced_anychannels   ... bench:  80,050,009 ns/iter (+/- 2,803,357)
test write_scanlines_interlaced_custom        ... bench:  75,102,131 ns/iter (+/- 1,026,776)
test write_scanlines_specificchannels         ... bench:  80,844,714 ns/iter (+/- 1,955,587)

johannesvollmer · 2023-01-20T12:15:29Z

I think something to investigate is why doing a manual deinterlacing step is so much faster than using SpecificChannels. It really shouldn't be.

It would be great if we could optimized the existing code, awesome :)

johannesvollmer · 2023-01-20T12:51:50Z

Hmm, where does interlacing or deinterlacing happen?

See #178 in the comments. It's part of the SpecificChannels, as you already seem to have found out :)

johannesvollmer · 2023-01-20T12:55:18Z

this part is important:

The final pixel tuples are stored in a temporary vector for that scanline, which contains placeholder values at first. Then for each channel, we step through the source samples in the slice, and for each sample we overwrite that one component of the corresponding pixel in the temporary vector.

it's roughly in this part of the code:

exrs/src/image/read/specific_channels.rs

Lines 290 to 302 in 49fece0

    
           // match outside the loop to avoid matching on every single sample 
        
           match self.channel.sample_type { 
        
               SampleType::F16 => for pixel in pixels.iter_mut() { 
        
                   *get_pixel(pixel) = Sample::from_f16(f16::read(&mut own_bytes_reader).expect(error_msg)); 
        
               }, 
        
               SampleType::F32 => for pixel in pixels.iter_mut() { 
        
                   *get_pixel(pixel) = Sample::from_f32(f32::read(&mut own_bytes_reader).expect(error_msg)); 
        
               }, 
        
               SampleType::U32 => for pixel in pixels.iter_mut() { 
        
                   *get_pixel(pixel) = Sample::from_u32(u32::read(&mut own_bytes_reader).expect(error_msg)); 
        
               },

expenses · 2023-01-20T18:01:36Z

Ah, okay cool :)

Honestly I think the conclusion that I've drawn from doing these benchmarks is that this crate is doing things correctly when it comes to performance :) There are a few differences between the existing code and the function I've written, but they mostly come down to my code being a bit more straightforward.

One idea I did test was that if the input rgb(a) buffer is mutable, either a &mut [f32] or Vec<f32> you can deinterlace it inplace using a function like this:

fn deinterlace_rgba_rows_inplace(rgba: &mut [f32], width: usize, height: usize) {
    let mut row = vec![0.0; width * 3];

    for y in 0 .. height {
        let mut i = 0;
        {
            let input = &rgba[y * width * 4 .. (y + 1) * width * 4];
            let (mut blue, mut remaining) = row.split_at_mut(width);
            let (mut green, mut red) = remaining.split_at_mut(width);
            for (((chunk, red), green), blue) in input.chunks_exact(4).zip(red).zip(green).zip(blue) {
                *red = chunk[0];
                *green = chunk[1];
                *blue = chunk[2];
            }
        }
        rgba[y * width * 3..(y +1) * width * 3].copy_from_slice(&row);
    }
}

This only allocates a row's worth of pixels which is quicker. Additionally the data for each row can be written out with a single memory copy (apart from the row headers).

This has good benchmark performance:

test write_scanlines_deinterlaced_anychannels ... bench:  74,560,475 ns/iter (+/- 18,268,405)
test write_scanlines_deinterlaced_custom      ... bench:  64,145,105 ns/iter (+/- 20,254,319)
test write_scanlines_interlaced_anychannels   ... bench: 111,880,132 ns/iter (+/- 23,037,003)
test write_scanlines_interlaced_custom        ... bench: 107,324,443 ns/iter (+/- 11,840,847)
test write_scanlines_interlaced_custom_bgr    ... bench:  77,506,742 ns/iter (+/- 3,902,751)
test write_scanlines_specificchannels         ... bench:  86,368,055 ns/iter (+/- 3,157,508)

This requirement of the input being mutable does make things harder though - if you cloned a &[f32] into a Vec<f32> before using deinterlace_rgba_rows_inplace I'm pretty sure that would cancel out the gains.

johannesvollmer · 2023-01-20T19:22:56Z

thanks! appreciate it. we should not stop to profile though. currently the efforts are focused on doing all the work in multiple threads, including interlacing, which will help :)

unfortunately deinterlacing in-place also requires all channels to have the same sample type. it might be worth to optimize for that since it is the most common case, but in general the library must support arbitrary channels. also in general, the library does not assume any pixel storage, it will not assume that the user stores their pixels in a vector in the first place. that's why the API is built with a callback of the form |x,y| (r, g, b), which works with individual pixels

Shnatsel · 2023-01-27T20:05:36Z

Note that interlacing and deinterlacing has been optimized a lot in #173, which is not yet part of any stable release. Make sure you're profiling the latest version from git, so that you get the optimized implementations.

Shnatsel · 2023-02-08T23:12:28Z

Since no action appears to be is needed right now, should this be closed or moved to a discussion?

johannesvollmer · 2023-02-09T02:07:59Z

Thanks for taking the time to tinker, benchmark, and letting us know about your results, @expenses :) I'll enable discussions as a place to collect detailed insights

johannesvollmer closed this as completed Feb 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking research: deinterlacing rgb(a) buffers is the slow part #192

Benchmarking research: deinterlacing rgb(a) buffers is the slow part #192

expenses commented Jan 19, 2023

johannesvollmer commented Jan 19, 2023 •

edited

Loading

johannesvollmer commented Jan 19, 2023

johannesvollmer commented Jan 19, 2023

expenses commented Jan 20, 2023

expenses commented Jan 20, 2023

expenses commented Jan 20, 2023

expenses commented Jan 20, 2023

johannesvollmer commented Jan 20, 2023 •

edited

Loading

johannesvollmer commented Jan 20, 2023

johannesvollmer commented Jan 20, 2023

expenses commented Jan 20, 2023

johannesvollmer commented Jan 20, 2023 •

edited

Loading

Shnatsel commented Jan 27, 2023

Shnatsel commented Feb 8, 2023

johannesvollmer commented Feb 9, 2023 •

edited

Loading

Benchmarking research: deinterlacing rgb(a) buffers is the slow part #192

Benchmarking research: deinterlacing rgb(a) buffers is the slow part #192

Comments

expenses commented Jan 19, 2023

johannesvollmer commented Jan 19, 2023 • edited Loading

johannesvollmer commented Jan 19, 2023

johannesvollmer commented Jan 19, 2023

expenses commented Jan 20, 2023

expenses commented Jan 20, 2023

expenses commented Jan 20, 2023

expenses commented Jan 20, 2023

johannesvollmer commented Jan 20, 2023 • edited Loading

johannesvollmer commented Jan 20, 2023

johannesvollmer commented Jan 20, 2023

expenses commented Jan 20, 2023

johannesvollmer commented Jan 20, 2023 • edited Loading

Shnatsel commented Jan 27, 2023

Shnatsel commented Feb 8, 2023

johannesvollmer commented Feb 9, 2023 • edited Loading

johannesvollmer commented Jan 19, 2023 •

edited

Loading

johannesvollmer commented Jan 20, 2023 •

edited

Loading

johannesvollmer commented Jan 20, 2023 •

edited

Loading

johannesvollmer commented Feb 9, 2023 •

edited

Loading