Skip to content

take fast path if c2c transform does not need padding or trimming#283

Open
chillenb wants to merge 6 commits intoIntelPython:masterfrom
chillenb:faster
Open

take fast path if c2c transform does not need padding or trimming#283
chillenb wants to merge 6 commits intoIntelPython:masterfrom
chillenb:faster

Conversation

@chillenb
Copy link

@chillenb chillenb commented Mar 3, 2026

Thanks for creating and maintaining this package!

If you try to get MKL C-API performance out of this package, you will probably discover that fftn is very sensitive to the input arguments. Here is an example:

In [1]: import numpy as np
   ...: import mkl_fft
   ...: N = 200
   ...: A = np.random.random((1,N,N,N)).astype(np.complex128)
 
In [2]: %timeit mkl_fft.fftn(A, s=A.shape[1:], axes=(1,2,3))
164 ms ± 187 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %timeit mkl_fft.fftn(A, axes=(1,2,3))
6.56 ms ± 304 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit mkl_fft.interfaces.numpy_fft.fftn(A, axes=(1,2,3))
165 ms ± 31.3 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This is because mkl_fft.fftn always takes a slow path (_iter_fftnd) when s != None. Furthermore, the NumPy and SciPy interfaces don't pass through s=None unchanged, so they are also forced to take this path.
This pull request allows fftn to detect when the input s argument is equivalent to s=None so it can use the faster function _iter_complementary.

After these code changes, performance aligns better with expectations:

In [1]: import numpy as np
   ...: import mkl_fft
   ...: N = 200
   ...: A = np.random.random((1,N,N,N)).astype(np.complex128)

In [2]: %timeit mkl_fft.interfaces.numpy_fft.fftn(A, axes=(1,2,3))
8.28 ms ± 551 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %timeit mkl_fft.fftn(A, s=A.shape[1:], axes=(1,2,3))
The slowest run took 4.49 times longer than the fastest. This could mean that an intermediate result is being cached.
9.92 ms ± 7.49 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit mkl_fft.fftn(A, axes=(1,2,3))
6.49 ms ± 60.7 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Test system: dual-socket Xeon Platinum 8268 server.

@intel-python-devops
Copy link

Can one of the admins verify this patch?

@chillenb
Copy link
Author

chillenb commented Mar 3, 2026

Oops, I didn't realize that the CI would have to be approved again after fixing the whitespace. Sorry!

The tests previously approved did run and pass, though.

@ndgrigorian
Copy link
Collaborator

Oops, I didn't realize that the CI would have to be approved again after fixing the whitespace. Sorry!

The tests previously approved did run and pass, though.

It's no problem, thanks for this contribution to the project. :)

I'm not sure if any of our tests currently cover this case and compare with e.g. numpy, so adding a test would be good too.

@antonwolfy antonwolfy added this to the 2.2.0 release milestone Mar 6, 2026
Copy link
Collaborator

@antonwolfy antonwolfy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please populate the changelog also.

chillenb added a commit to chillenb/mkl_fft that referenced this pull request Mar 6, 2026
chillenb added a commit to chillenb/mkl_fft that referenced this pull request Mar 6, 2026
@chillenb
Copy link
Author

chillenb commented Mar 6, 2026

I just rebased off the master branch to get rid of the merge conflict, so the conversation looks a bit garbled--apologies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants