-
Notifications
You must be signed in to change notification settings - Fork 275
Repack to half-length fft_small convolution for tiny moduli in _nmod_poly_mul and mullow #2478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…poly_mul and _nmod_poly_mullow
| static const short fft_mul_tab[] = {1326, 1326, 1095, 802, 674, 537, 330, 306, 290, | ||
| 274, 200, 192, 182, 173, 163, 99, 97, 93, 90, 82, 80, 438, 414, 324, 393, | ||
| 298, 298, 268, 187, 185, 176, 176, 168, 167, 158, 158, 97, 96, 93, 92, 89, | ||
| 89, 85, 85, 80, 81, 177, 172, 163, 162, 164, 176, 171, 167, 167, 164, 163, | ||
| 163, 160, 165, 95, 96, 90, 94, }; | ||
|
|
||
| static const short fft_sqr_tab[] = {1420, 1420, 1353, 964, 689, 569, 407, 353, 321, | ||
| 321, 292, 279, 200, 182, 182, 159, 159, 152, 145, 139, 723, 626, 626, 569, | ||
| 597, 448, 542, 292, 292, 200, 191, 191, 182, 182, 166, 166, 166, 159, 159, | ||
| 159, 152, 152, 145, 145, 93, 200, 191, 182, 182, 182, 182, 191, 191, 191, | ||
| 182, 182, 174, 182, 182, 182, 152, 152, 152, 145, }; | ||
|
|
||
| /* todo: separate squaring table */ | ||
| /* todo: check unbalanced cutoffs */ | ||
| static const short fft_mullow_tab[] = {1115, 1115, 597, 569, 407, 321, 306, 279, 191, | ||
| 182, 166, 159, 152, 145, 139, 89, 85, 78, 75, 75, 69, 174, 174, 166, 159, | ||
| 152, 152, 152, 97, 101, 106, 111, 101, 101, 101, 139, 145, 145, 139, 145, | ||
| 145, 139, 145, 145, 145, 182, 182, 182, 182, 182, 182, 191, 200, 220, 210, | ||
| 200, 210, 210, 210, 210, 191, 182, 182, 174, }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these tabs architecture specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and all the other tuning parameters in this file too. Note that these particular tabs were around before; I just moved them to a new file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I did notice that. Do we have a tuning program somewhere that we can use at a later point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/gr_poly/tune/cutoffs.c can generate this kind of table but it's not automatic.
|
Nice!
Is this different from using "Shoup precomputation" (as in #2061 ) but specialized to the 32-bit context? |
Unless I missed something, the Shoup reduction still requires one conditional adjustment, but the Lemire et al. method doesn't: the high part of a product gives the exact remainder right away. |
Co-authored-by: Albin Ahlbäck <albin.ahlback@gmail.com>
When the unreduced coefficients of an
_nmod_polyproduct are small enough to fit in 50 bits,fft_smallmultiplies using a single FFT prime.We can do even better when the moduli are really small by packing a linear polynomial into each coefficient. If the coefficients of the product (over$\mathbb{Z}$ ) are smaller than $M$ , we can repack $a_0 + a_1 x + \ldots$ as $(a_0 + a_1 M) + (a_2 + a_3 M) x + \ldots$ . The coefficients of the product of such polynomials will be quadratic polynomials in $M$ , from which we can read off the coefficients of the original product.
It is clear that we can take$M = 2^{16}$ since $M^3$ is smaller than the 50-bit FFT primes used by $M = 2^{17}$ .
fft_small. With a more careful analysis we can show that it is usually possible to work withWhere this trick is applicable, currently for moduli$1 \le m \le 23$ , it gives up to a 2x speedup by halving the convolution length. Plots demonstrating the speedup for
_nmod_poly_muland_nmod_poly_mulloware attached below.Input lengths where this trick is applicable are roughly up to 250000 for$m = 2$ , 77000 for $m = 3$ , 21000 for $m = 5$ , 9800 for $m = 7$ , 3600 for $m = 11$ , 1400 for $m = 17$ and 700 for $m = 23$ .
This is definitely a hack: an optimized 32-bit FFT/NTT (or maybe something Schonhage-Strassen-like with many coefficients bit-packed in each word) should perform even better, and would allow much larger moduli and longer products. But this hack is easy to implement with the tools we have in FLINT right now, so we may as well use it.
Example improvement on a nontrivial benchmark problem: constructing GF($5^{3125}$ ) previously took 5.25 seconds, takes 3.97 seconds with this PR (1.32x speedup).
BTW, a new (to FLINT) trick used in the unpacking code is the 32-bit precomped remainder algorithm by Lemire, Kaser & Kurz which could be useful elsewhere in the
nmodmodules. This is even faster than the code generated by GCC for remainder by a compile-time constant.