Skip to content

Conversation

@vneiger
Copy link
Collaborator

@vneiger vneiger commented Nov 2, 2025

This adds functions for computing the quotient and remainder in the division by $x^n - c$. Special cases $c = -1$, $c = 0$, $c = 1$ are handled more efficiently with specific code, and for general $c$ this uses precomputations on $c$ if the modulus allows it.

Tests, documentation, profiling files updated.

Excerpt of profile output on zen4 and meteor lake (full output in file below) :

profiled: 1. divrem 2. new, poly interface 3. new, vec general 4. new, precomp 5. new, precomp_lazy
note: variant 2. expected to be faster than variants 3.4.5. when c == 1 or c == -1

                      zen4 AMD Ryzen 7 PRO 7840U 5.1GHz        meteor lake Intel Core Ultra 7 165H 5.0GHz
bit c len    n      1.      2.      3.      4.      5.           1.      2.      3.      4.      5.

## c == 1 | modulus x**n - 1
62  0 5      1      2.1e+02 1.5e+02 1.1e+02 8.7e+01 4.7e+01      1.9e+02 1.6e+02 1.3e+02 1.1e+02 4.2e+01
62  0 40     5      2.6e+03 1.7e+02 6.5e+02 5.1e+02 1.1e+02      2.7e+03 1.5e+02 6.1e+02 4.9e+02 8.2e+01
62  0 160    25     1.0e+04 3.2e+02 2.7e+03 1.9e+03 3.3e+02      9.1e+03 2.7e+02 2.4e+03 2.0e+03 2.7e+02
62  0 2560   25     1.8e+05 3.7e+03 4.9e+04 3.4e+04 5.5e+03      1.5e+05 3.6e+03 4.4e+04 3.5e+04 5.0e+03
62  0 2560   625    7.3e+05 2.4e+03 3.6e+04 2.5e+04 4.0e+03      7.3e+05 3.0e+03 3.3e+04 2.6e+04 3.9e+03
62  0 20480  25     1.5e+06 2.9e+04 3.9e+05 2.8e+05 4.4e+04      1.2e+06 2.8e+04 3.6e+05 2.9e+05 4.1e+04
62  0 20480  3125   6.7e+06 2.2e+04 3.3e+05 2.3e+05 3.6e+04      6.9e+06 2.9e+04 3.0e+05 2.4e+05 3.4e+04
62  0 327680 25     2.3e+07 4.9e+05 6.3e+06 4.5e+06 7.1e+05      2.1e+07 6.2e+05 5.8e+06 4.6e+06 6.7e+05
62  0 327680 3125   1.1e+08 3.9e+05 6.1e+06 4.3e+06 6.7e+05      1.2e+08 6.4e+05 5.6e+06 4.4e+06 6.3e+05

## c general | modulus x**n - c
62  2 5      1      2.1e+02 1.2e+02 1.2e+02 9.3e+01 4.7e+01      1.9e+02 1.2e+02 1.4e+02 1.2e+02 4.1e+01
62  2 40     5      2.6e+03 2.0e+02 7.3e+02 5.5e+02 1.1e+02      2.6e+03 1.9e+02 6.8e+02 5.6e+02 8.1e+01
62  2 160    25     1.1e+04 5.6e+02 2.7e+03 1.9e+03 3.2e+02      9.2e+03 5.8e+02 2.5e+03 2.0e+03 2.7e+02
62  2 2560   25     1.8e+05 6.9e+03 4.9e+04 3.4e+04 5.5e+03      1.5e+05 7.1e+03 4.4e+04 3.6e+04 5.0e+03
62  2 2560   625    7.2e+05 5.6e+03 3.7e+04 2.5e+04 4.0e+03      7.4e+05 6.1e+03 3.3e+04 2.6e+04 3.8e+03
62  2 20480  25     1.5e+06 5.3e+04 4.0e+05 2.8e+05 4.4e+04      1.2e+06 5.5e+04 3.6e+05 2.9e+05 3.9e+04
62  2 20480  3125   6.6e+06 4.7e+04 3.3e+05 2.3e+05 3.6e+04      6.7e+06 5.3e+04 2.9e+05 2.3e+05 3.3e+04
62  2 327680 25     2.3e+07 8.7e+05 6.4e+06 4.5e+06 7.2e+05      2.0e+07 9.4e+05 5.8e+06 4.6e+06 6.7e+05
62  2 327680 3125   1.1e+08 8.4e+05 6.2e+06 4.3e+06 6.8e+05      1.2e+08 9.1e+05 5.6e+06 4.4e+06 6.4e+05

special_rem_profile.txt

}

int lazy = 0;
nn_ptr RQ = _nmod_vec_init(len);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A temporary allocation would be good to try here. @fredrik-johansson how do we deal with stack allocation for these types?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like what is at lines 89--94 of nmod_poly/rem.c ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe so!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will give this a try!

@fredrik-johansson
Copy link
Collaborator

A bit unusual for this code is that it does a lot of logic in the non-underscore method.

Does this functionality actually really make sense as a standalone method, or could it be a special case of a method for dividing by a sparse polynomial after a precomputation? This is currently used in the fq_nmod context and reimplemented as a hack (with the precomputation done inline) in _nmod_poly_divrem_newton_n_preinv. One could very easily put in a branch to check for monic binomials, among with other special cases like monic trinomials. I thought about putting in Shoup precomputations and lazy reduction there too but didn't get around to it. It would be nice to do this systematically for other sparse polynomials. To clarify, I have no objection to the standalone function proposed in this PR, just asking if it doesn't make sense to generalise it right away.

@vneiger
Copy link
Collaborator Author

vneiger commented Nov 3, 2025

A bit unusual for this code is that it does a lot of logic in the non-underscore method.

Does this functionality actually really make sense as a standalone method, or could it be a special case of a method for dividing by a sparse polynomial after a precomputation? This is currently used in the fq_nmod context and reimplemented as a hack (with the precomputation done inline) in _nmod_poly_divrem_newton_n_preinv. One could very easily put in a branch to check for monic binomials, among with other special cases like monic trinomials. I thought about putting in Shoup precomputations and lazy reduction there too but didn't get around to it. It would be nice to do this systematically for other sparse polynomials. To clarify, I have no objection to the standalone function proposed in this PR, just asking if it doesn't make sense to generalise it right away.

This question --- fast remainder on division by a binomial $x^n - c$ --- arose while working on the draft PR on n_fft. There I only need the underscore lazy variant, and maybe the $x^n - 1$ variant. Also, no need for the quotient there, but there the remainder is computed in-place (erasing the input polynomial) in which case the quotient appears during the computation anyway.

So my main wish is to have good performance of the underscore lazy variant (which probably means keeping this one specific to binomials). For the rest, I thought this was a fairly classical operation to work on univariate polynomials modulo $x^n - 1$ or $x^n + 1$, even when the modulus is not prime, but I actually have no use in mind for the general non-underscore variant. It could indeed be modified/discarded to provide this via a more general sparse division (I will have a look at the pointers you give).

@vneiger
Copy link
Collaborator Author

vneiger commented Nov 3, 2025

A bit unusual for this code is that it does a lot of logic in the non-underscore method.

Does this functionality actually really make sense as a standalone method, or could it be a special case of a method for dividing by a sparse polynomial after a precomputation? This is currently used in the fq_nmod context and reimplemented as a hack (with the precomputation done inline) in _nmod_poly_divrem_newton_n_preinv. One could very easily put in a branch to check for monic binomials, among with other special cases like monic trinomials. I thought about putting in Shoup precomputations and lazy reduction there too but didn't get around to it. It would be nice to do this systematically for other sparse polynomials. To clarify, I have no objection to the standalone function proposed in this PR, just asking if it doesn't make sense to generalise it right away.

I've just had a look at _nmod_poly_divrem_try_sparse . As you wrote, there is clear overlap in this PR with what is there, so this should be worked on before any kind of merge. I will put this PR in draft for now and handle this later.

Interestingly, for monic binomials, the code in try_sparse is slightly faster than the code in this PR when Shoup precomputations are not possible. I haven't checked why, but this is not due to temp allocations alone. Possibly the mix of while and for loop in the variant I wrote prevents the compiler from unrolling loops nicely. The roadmap for this PR could then become something like this, to get the best of both pieces of code:

  • make sure performance (without precomputations) is at least as good as try_sparse,
  • allow sparse monic polynomials with more than two monomials (like try_sparse),
  • incorporate some Shoup precomputations / laziness when the modulus permits,
  • expose this as some divrem_sparse and rem_sparse function?
  • (... and don't forget to write documentation)

By the way, in case this rings any bell of other places where this may be useful/existing in flint: for n_fft things I will also need the "transposed" operation of remainder mod $x^n - c$, which consists in extending a linearly recurrent sequence with recurrence $x^n - c$ from its first $n$ terms $u_0,\ldots,u_{n-1}$ (in this binomial case, this is just multiplying these terms by suitable powers of $c$).

@vneiger vneiger marked this pull request as draft November 3, 2025 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants