-
Notifications
You must be signed in to change notification settings - Fork 13
Better small size kernels #75
Description
RustFFT can do a size-32 FFT entirely in registers on AVX2 via a dedicated compute kernel. size-64 is not entirely in registers but still highly optimized.
Meanwhile we use the generic structure with lots of subsequent radix-2 passes over the input, so we have to do lots of loads and stores between the actual math.
I don't want to have dedicated handwritten kernels for small sizes. What I'd like to do is replace the current dedicated code for small-size passes with one or two kernels (maybe size 32 or 64 or both) that larger FFTs can be reduced to (plus generic size-independent passes on top). This means we can't simply crib the RustFFT kernels because they do mixed-radix radix-4-2 with integrated bit reversal; we want radix-2 (likely with optimizations from #74) and general bit reversal so this kernel can be reused as a component of processing larger sizes.