Skip to content

Conversation

@fredrik-johansson
Copy link
Collaborator

When the unreduced coefficients of an _nmod_poly product are small enough to fit in 50 bits, fft_small multiplies using a single FFT prime.

We can do even better when the moduli are really small by packing a linear polynomial into each coefficient. If the coefficients of the product (over $\mathbb{Z}$) are smaller than $M$, we can repack $a_0 + a_1 x + \ldots$ as $(a_0 + a_1 M) + (a_2 + a_3 M) x + \ldots$. The coefficients of the product of such polynomials will be quadratic polynomials in $M$, from which we can read off the coefficients of the original product.

It is clear that we can take $M = 2^{16}$ since $M^3$ is smaller than the 50-bit FFT primes used by fft_small. With a more careful analysis we can show that it is usually possible to work with $M = 2^{17}$.

Where this trick is applicable, currently for moduli $1 \le m \le 23$, it gives up to a 2x speedup by halving the convolution length. Plots demonstrating the speedup for _nmod_poly_mul and _nmod_poly_mullow are attached below.

Input lengths where this trick is applicable are roughly up to 250000 for $m = 2$, 77000 for $m = 3$, 21000 for $m = 5$, 9800 for $m = 7$, 3600 for $m = 11$, 1400 for $m = 17$ and 700 for $m = 23$.

This is definitely a hack: an optimized 32-bit FFT/NTT (or maybe something Schonhage-Strassen-like with many coefficients bit-packed in each word) should perform even better, and would allow much larger moduli and longer products. But this hack is easy to implement with the tools we have in FLINT right now, so we may as well use it.

Example improvement on a nontrivial benchmark problem: constructing GF($5^{3125}$) previously took 5.25 seconds, takes 3.97 seconds with this PR (1.32x speedup).

BTW, a new (to FLINT) trick used in the unpacking code is the 32-bit precomped remainder algorithm by Lemire, Kaser & Kurz which could be useful elsewhere in the nmod modules. This is even faster than the code generated by GCC for remainder by a compile-time constant.

mul_balanced
mullow_balanced
mul_unbalanced
mullow_unbalanced
sqr
sqrlow

Comment on lines +426 to +444
static const short fft_mul_tab[] = {1326, 1326, 1095, 802, 674, 537, 330, 306, 290,
274, 200, 192, 182, 173, 163, 99, 97, 93, 90, 82, 80, 438, 414, 324, 393,
298, 298, 268, 187, 185, 176, 176, 168, 167, 158, 158, 97, 96, 93, 92, 89,
89, 85, 85, 80, 81, 177, 172, 163, 162, 164, 176, 171, 167, 167, 164, 163,
163, 160, 165, 95, 96, 90, 94, };

static const short fft_sqr_tab[] = {1420, 1420, 1353, 964, 689, 569, 407, 353, 321,
321, 292, 279, 200, 182, 182, 159, 159, 152, 145, 139, 723, 626, 626, 569,
597, 448, 542, 292, 292, 200, 191, 191, 182, 182, 166, 166, 166, 159, 159,
159, 152, 152, 145, 145, 93, 200, 191, 182, 182, 182, 182, 191, 191, 191,
182, 182, 174, 182, 182, 182, 152, 152, 152, 145, };

/* todo: separate squaring table */
/* todo: check unbalanced cutoffs */
static const short fft_mullow_tab[] = {1115, 1115, 597, 569, 407, 321, 306, 279, 191,
182, 166, 159, 152, 145, 139, 89, 85, 78, 75, 75, 69, 174, 174, 166, 159,
152, 152, 152, 97, 101, 106, 111, 101, 101, 101, 139, 145, 145, 139, 145,
145, 139, 145, 145, 145, 182, 182, 182, 182, 182, 182, 191, 200, 220, 210,
200, 210, 210, 210, 210, 191, 182, 182, 174, };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these tabs architecture specific?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and all the other tuning parameters in this file too. Note that these particular tabs were around before; I just moved them to a new file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did notice that. Do we have a tuning program somewhere that we can use at a later point?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/gr_poly/tune/cutoffs.c can generate this kind of table but it's not automatic.

@vneiger
Copy link
Collaborator

vneiger commented Nov 6, 2025

Nice!

BTW, a new (to FLINT) trick used in the unpacking code is the 32-bit precomped remainder algorithm by Lemire, Kaser & Kurz which could be useful elsewhere in the nmod modules. This is even faster than the code generated by GCC for remainder by a compile-time constant.

Is this different from using "Shoup precomputation" (as in #2061 ) but specialized to the 32-bit context?

@fredrik-johansson
Copy link
Collaborator Author

Nice!

BTW, a new (to FLINT) trick used in the unpacking code is the 32-bit precomped remainder algorithm by Lemire, Kaser & Kurz which could be useful elsewhere in the nmod modules. This is even faster than the code generated by GCC for remainder by a compile-time constant.

Is this different from using "Shoup precomputation" (as in #2061 ) but specialized to the 32-bit context?

Unless I missed something, the Shoup reduction still requires one conditional adjustment, but the Lemire et al. method doesn't: the high part of a product gives the exact remainder right away.

Co-authored-by: Albin Ahlbäck <albin.ahlback@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants