Discussion of possible algorithmic approaches with the good stuff near the end
It can even be fully parallelized, but for the block shuffling pass it would require either a rav1d-style DisjointMut which is unsafe but maybe not in a really bad way, or something like Vec<Mutex<&mut [f32]>> while the per-block operation should be trivial to express with SIMD and a small out-of-place buffer