Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use VPERMB _mm512_permutexvar_epi8 for DES and/or Lotus #5706

Open
solardiz opened this issue Mar 22, 2025 · 3 comments
Open

Use VPERMB _mm512_permutexvar_epi8 for DES and/or Lotus #5706

solardiz opened this issue Mar 22, 2025 · 3 comments
Labels
enhancement RFC / discussion Help or comments wanted

Comments

@solardiz
Copy link
Member

This is another recent instruction introduced in Intel Cannon Lake (9th gen, but not all) and above (consistently since Ice Lake, 10th gen) through the VBMI extension on top of AVX-512. It appears to perform a mapping that's just right for one DES S-box, 64 times in parallel. So a non-bitslice DES implementation using this instruction may outperform bitslice.

https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm512_permutexvar_epi8&ig_expand=5071

__m512i _mm512_permutexvar_epi8 (__m512i idx, __m512i a)
#include <immintrin.h>
Instruction: vpermb zmm, zmm, zmm
CPUID Flags: AVX512_VBMI
Description
Shuffle 8-bit integers in a across lanes using the corresponding index in idx, and store the results in dst.
Operation
FOR j := 0 to 63
	i := j*8
	id := idx[i+5:i]*8
	dst[i+7:i] := a[id+7:id]
ENDFOR
dst[MAX:512] := 0
@solardiz
Copy link
Member Author

Looks like this can provide great speedup for S-box lookups over bitslice: 22.125 VPTERNLOG's per 512 lookups vs. 1 VPERMB per 64, so kind of a ~2.76x speedup for this step. And either of these is times 8 for the 8 S-boxes. However, then things become difficult with DES P table. For input to VPERMB, our DES blocks need to be in 8 vectors (one byte from each computation per vector). However, each 4-bit output of S-box lookup then goes into P, which we'd need to implement by separating the 4 bits into up to 4 different vectors, so that they're in the right place for the next round. This will quite likely remove the whole advantage, because it's then something like 5 instructions per S-box lookup (VPERMB + 4 more), which is already twice worse than we have with bitslice. And then there's E, which is relatively easy without salts (LM hash, etc.) because the table contains adjacent values, but with salts it would also be a permutation involving many of the 8 vectors.

P and E weren't causing much or any slowdown in scalar optimized implementations because S-box outputs could be readily represented in the table not as 4-bit values, but as the 4 bits spread out across 32- or 64-bit machine words. So we'd have combined SP or SPE tables applying these transformations in one go. With VPERMB, the output of each lookup is 8-bit max, and the space in vectors we're preparing for next round is only 8 bits per S-box per computation, so we're very limited in how we may apply this technique. In most cases, P output positions are too far apart for even a pair of bits to go into the same output vector.

So this is probably a no-go primarily because of P. But it would be fun for someone to try this out and show actual results.

@solardiz solardiz added the RFC / discussion Help or comments wanted label Mar 23, 2025
@solardiz solardiz changed the title Use VPERMB _mm512_permutexvar_epi8 for DES Use VPERMB _mm512_permutexvar_epi8 for Lotus Mar 23, 2025
@solardiz
Copy link
Member Author

Looks like this won't help for DES, but it might for Lotus where we need the 8-bit outputs and don't need to expand them further, and where the S-box expressions are much longer (although we also need to implement #5451 at least for systems without such instruction). The 6-bit inputs will be rather limiting - we'd need 4 VPERMB per S-box lookup followed by logic to select the right outputs by high 2 bits, with the S-box content spread across 4 vectors. This extra logic may be a performance killer. The break-even point appears to be at 178/8 = ~22 instructions per S-box lookup, so the question is whether we can do it in fewer than that (probably yes).

@solardiz solardiz changed the title Use VPERMB _mm512_permutexvar_epi8 for Lotus Use VPERMB _mm512_permutexvar_epi8 for DES and/or Lotus Mar 23, 2025
@solardiz
Copy link
Member Author

each 4-bit output of S-box lookup then goes into P, which we'd need to implement by separating the 4 bits into up to 4 different vectors, so that they're in the right place for the next round. This will quite likely remove the whole advantage, because it's then something like 5 instructions per S-box lookup (VPERMB + 4 more)

I was wrongly thinking of applying P and E right after each S-box lookup like we do in the bitslice implementation, but with a more traditional implementation we'd be applying them after all 8 S-box lookups, so it probably won't be so many instructions per lookup. This may be reasonable, after all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement RFC / discussion Help or comments wanted
Projects
None yet
Development

No branches or pull requests

1 participant