Use VPERMB _mm512_permutexvar_epi8 for DES and/or Lotus #5706

solardiz · 2025-03-22T23:37:13Z

This is another recent instruction introduced in Intel Cannon Lake (9th gen, but not all) and above (consistently since Ice Lake, 10th gen) through the VBMI extension on top of AVX-512. It appears to perform a mapping that's just right for one DES S-box, 64 times in parallel. So a non-bitslice DES implementation using this instruction may outperform bitslice.

https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm512_permutexvar_epi8&ig_expand=5071

__m512i _mm512_permutexvar_epi8 (__m512i idx, __m512i a)
#include <immintrin.h>
Instruction: vpermb zmm, zmm, zmm
CPUID Flags: AVX512_VBMI
Description
Shuffle 8-bit integers in a across lanes using the corresponding index in idx, and store the results in dst.
Operation
FOR j := 0 to 63
	i := j*8
	id := idx[i+5:i]*8
	dst[i+7:i] := a[id+7:id]
ENDFOR
dst[MAX:512] := 0

The text was updated successfully, but these errors were encountered:

solardiz · 2025-03-23T00:14:06Z

Looks like this can provide great speedup for S-box lookups over bitslice: 22.125 VPTERNLOG's per 512 lookups vs. 1 VPERMB per 64, so kind of a ~2.76x speedup for this step. And either of these is times 8 for the 8 S-boxes. However, then things become difficult with DES P table. For input to VPERMB, our DES blocks need to be in 8 vectors (one byte from each computation per vector). However, each 4-bit output of S-box lookup then goes into P, which we'd need to implement by separating the 4 bits into up to 4 different vectors, so that they're in the right place for the next round. This will quite likely remove the whole advantage, because it's then something like 5 instructions per S-box lookup (VPERMB + 4 more), which is already twice worse than we have with bitslice. And then there's E, which is relatively easy without salts (LM hash, etc.) because the table contains adjacent values, but with salts it would also be a permutation involving many of the 8 vectors.

P and E weren't causing much or any slowdown in scalar optimized implementations because S-box outputs could be readily represented in the table not as 4-bit values, but as the 4 bits spread out across 32- or 64-bit machine words. So we'd have combined SP or SPE tables applying these transformations in one go. With VPERMB, the output of each lookup is 8-bit max, and the space in vectors we're preparing for next round is only 8 bits per S-box per computation, so we're very limited in how we may apply this technique. In most cases, P output positions are too far apart for even a pair of bits to go into the same output vector.

So this is probably a no-go primarily because of P. But it would be fun for someone to try this out and show actual results.

solardiz · 2025-03-23T02:19:32Z

Looks like this won't help for DES, but it might for Lotus where we need the 8-bit outputs and don't need to expand them further, and where the S-box expressions are much longer (although we also need to implement #5451 at least for systems without such instruction). The 6-bit inputs will be rather limiting - we'd need 4 VPERMB per S-box lookup followed by logic to select the right outputs by high 2 bits, with the S-box content spread across 4 vectors. This extra logic may be a performance killer. The break-even point appears to be at 178/8 = ~22 instructions per S-box lookup, so the question is whether we can do it in fewer than that (probably yes).

solardiz · 2025-03-23T18:59:24Z

each 4-bit output of S-box lookup then goes into P, which we'd need to implement by separating the 4 bits into up to 4 different vectors, so that they're in the right place for the next round. This will quite likely remove the whole advantage, because it's then something like 5 instructions per S-box lookup (VPERMB + 4 more)

I was wrongly thinking of applying P and E right after each S-box lookup like we do in the bitslice implementation, but with a more traditional implementation we'd be applying them after all 8 S-box lookups, so it probably won't be so many instructions per lookup. This may be reasonable, after all.

solardiz added the enhancement label Mar 22, 2025

solardiz added the RFC / discussion Help or comments wanted label Mar 23, 2025

solardiz changed the title ~~Use VPERMB _mm512_permutexvar_epi8 for DES~~ Use VPERMB _mm512_permutexvar_epi8 for Lotus Mar 23, 2025

solardiz changed the title ~~Use VPERMB _mm512_permutexvar_epi8 for Lotus~~ Use VPERMB _mm512_permutexvar_epi8 for DES and/or Lotus Mar 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use VPERMB _mm512_permutexvar_epi8 for DES and/or Lotus #5706

Use VPERMB _mm512_permutexvar_epi8 for DES and/or Lotus #5706

solardiz commented Mar 22, 2025

solardiz commented Mar 23, 2025

solardiz commented Mar 23, 2025

solardiz commented Mar 23, 2025

Use VPERMB _mm512_permutexvar_epi8 for DES and/or Lotus #5706

Use VPERMB _mm512_permutexvar_epi8 for DES and/or Lotus #5706

Comments

solardiz commented Mar 22, 2025

solardiz commented Mar 23, 2025

solardiz commented Mar 23, 2025

solardiz commented Mar 23, 2025