-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pixel format conversions are slow #178
Comments
The code for conversion is pretty complex as it is. That's not good, of course, I'll explain why it is already rather complex: The user of the library can specify the type of pixels they expect by declaring a tuple type, for example To achieve this, many intermediate reader objects must be constructed. They are built using recursive types, to avoid having to implement each trait multiple times for all the tuple variations. The recursive types look like this It was important to me to provide the most ergonomic API, so in the end, it will always be a pixel-by-pixel callback for the user. For that reason, most of the code in the library also had to be in a pixel-by-pixel structure. However, it might be possible to change some of the internal structure to work more in batches. That's why it's pretty complex unfortunately. I hope this clarifies how the recursive types work. I was not able to find a cleaner solution unfortunately. What I want to say is that this issue will require some very tricky type management. We should definitely make sure that there are no easier alternatives before we invest the time. For example, we should definitely add some documentation that clearly recommends using f32 where possible. |
We could try to allocate an intermediate buffer just for conversion purposes, if it is more performant. This way we might get away with faster conversion without having to change any of the recursive types structures (it is rather annoying). |
That does sound like a good idea. Those buffers could even be allocated on the stack - we don't need them to be big. E.g. for an We can then iterate over the buffer and run the user-supplied callback over it multiple times. |
This is the data layout during the reading process: In the file, the samples are stored line by line. In each scanline however, the values are not stored pixel by pixel, but instead channel by channel. This means that for each channel, all of its samples for that line are stored in a contiguous slice. This allows for batch conversion for all of the samples of a channel in theory. I think it should even be possible in the code snippet you included. The final pixel tuples are stored in a temporary vector for that scanline, which contains placeholder values at first. Then for each channel, we step through the source samples in the slice, and for each sample we overwrite that one component of the corresponding pixel in the temporary vector. This is the snipped you included. Those final tuples in the temporary vector are given to the library's user one by one. Edit: fixed some wrong information in the explanation above |
that sounds a lot like that simd library that offers an iterator based api, called don't know if they do conversion though |
yes! great idea :) |
in exrs/src/image/read/specific_channels.rs Lines 296 to 298 in 49fece0
I suggest we add a function in the snippet, get_pixel should actually be called get_sample |
Sounds good to me! Except the vector size should be a power of two. |
yes yes. then we can just use a chunks iterator and that's it, pretty simple after all, all the worries I had were not necessary haha |
Great! I'd prefer if you implemented it, since you're familiar with the code, and the difficulty is more in the API than in the algorithm. I'm happy to advise on the vectorization though! |
I'll prepare a prototype :) |
Once all the outstanding PRs are merged and this issue is squared away, I'd expect exrs to be roughly on par with the canonical OpenEXR implementation in terms of performance in multi-threaded mode (provided that intrinsics are used for We might even beat the canonical OpenEXR on ARM because the C implementation only includes SSE optimized functions and the scalar fallbacks are really naive, while exrs leans on the autovectorizer that can target any platform. |
that would be great! I hope I'll have time on the weekend |
Right now format conversion invokes a callback on every pixel access, and is generally oriented towards per-pixel processing:
exrs/src/image/read/specific_channels.rs
Lines 279 to 307 in 49fece0
Note the
get_pixel: impl Fn(&mut FullPixel) -> &mut Sample
in the signature.This prevents the compiler from vectorizing the code. There are vector instructions to turn batches of values of one type into another, but this interface prevents their use.
Lack of vectorization is especially bad for
f16
types, which do not exist natively on the CPU, and there are only vector instructions to convert to/fromf16
. See #177 for details - it's closed in favor of this one, but is still a problem.For
f16
conversions not to be atrociously slow, they need to be batched together explicitly. Thehalf
crate provides functions converting slices to/fromf16
, see e.g.convert_from_f32_slice
.For other types, straightforward conversion loops such as these will be automatically vectorized:
provided that
input
andoutput
are both slices. You can also convert into a Vec:The text was updated successfully, but these errors were encountered: