Skip to content

Latest commit

 

History

History
404 lines (350 loc) · 16.9 KB

x64.md

File metadata and controls

404 lines (350 loc) · 16.9 KB

x86-64 CPU benchmark results

AMD Ryzen7 9700X

Microarchitecture: Zen5

Setting: 8 Zen5 Cores

For single core:

$ ./cpufp --thread_pool=[0]
Number Threads: 1
Thread Pool Binding: 0
--------------------------------------------------------------
| Instruction Set | Core Computation      | Peak Performance |
| AVX512_VNNI     | DP4A(s32,u8,s8)       | 1.4139 TOPS      |
| AVX512_VNNI     | DP2A(s32,s16,s16)     | 706.78 GOPS      |
| AVX_VNNI        | DP4A(s32,u8,s8)       | 707.45 GOPS      |
| AVX_VNNI        | DP2A(s32,s16,s16)     | 353.69 GOPS      |
| AVX512_BF16     | DP2A(f32,bf16,bf16)   | 706.71 GFLOPS    |
| AVX512F         | FMA(f32,f32,f32)      | 353.3 GFLOPS     |
| AVX512F         | FMA(f64,f64,f64)      | 176.62 GFLOPS    |
| FMA             | FMA(f32,f32,f32)      | 176.83 GFLOPS    |
| FMA             | FMA(f64,f64,f64)      | 88.417 GFLOPS    |
| AVX             | ADD(MUL(f32,f32),f32) | 176.48 GFLOPS    |
| AVX             | ADD(MUL(f64,f64),f64) | 88.084 GFLOPS    |
| SSE             | ADD(MUL(f32,f32),f32) | 88.286 GFLOPS    |
| SSE2            | ADD(MUL(f64,f64),f64) | 44.058 GFLOPS    |
--------------------------------------------------------------

For 8 cores:

$ ./cpufp --thread_pool=[0-7]
Number Threads: 8
Thread Pool Binding: 0 1 2 3 4 5 6 7
--------------------------------------------------------------
| Instruction Set | Core Computation      | Peak Performance |
| AVX512_VNNI     | DP4A(s32,u8,s8)       | 10.859 TOPS      |
| AVX512_VNNI     | DP2A(s32,s16,s16)     | 5.4159 TOPS      |
| AVX_VNNI        | DP4A(s32,u8,s8)       | 5.4499 TOPS      |
| AVX_VNNI        | DP2A(s32,s16,s16)     | 2.7227 TOPS      |
| AVX512_BF16     | DP2A(f32,bf16,bf16)   | 5.4168 TFLOPS    |
| AVX512F         | FMA(f32,f32,f32)      | 2.7021 TFLOPS    |
| AVX512F         | FMA(f64,f64,f64)      | 1.3483 TFLOPS    |
| FMA             | FMA(f32,f32,f32)      | 1.3567 TFLOPS    |
| FMA             | FMA(f64,f64,f64)      | 678.31 GFLOPS    |
| AVX             | ADD(MUL(f32,f32),f32) | 1.2726 TFLOPS    |
| AVX             | ADD(MUL(f64,f64),f64) | 633.21 GFLOPS    |
| SSE             | ADD(MUL(f32,f32),f32) | 600.08 GFLOPS    |
| SSE2            | ADD(MUL(f64,f64),f64) | 333.63 GFLOPS    |
--------------------------------------------------------------

Intel Xeon w9-3495X

Microarchitecture: Sapphire Rapids

Setting: 1 Sockets x 56 Golden Cove Cores

For single core:

$ ./cpufp --thread_pool=[0]
Number Threads: 1
Thread Pool Binding: 0
--------------------------------------------------------------
| Instruction Set | Core Computation      | Peak Performance |
| AMX_INT8        | MM(s32,s8,s8)         | 5.6821 TOPS      |
| AMX_INT8        | MM(s32,s8,u8)         | 5.6854 TOPS      |
| AMX_INT8        | MM(s32,u8,s8)         | 5.6872 TOPS      |
| AMX_INT8        | MM(s32,u8,u8)         | 5.6905 TOPS      |
| AMX_BF16        | MM(f32,bf16,bf16)     | 2.8448 TFLOPS    |
| AVX512_VNNI     | DP4A(s32,u8,s8)       | 711.46 GOPS      |
| AVX512_VNNI     | DP2A(s32,s16,s16)     | 355.73 GOPS      |
| AVX_VNNI        | DP4A(s32,u8,s8)       | 368.94 GOPS      |
| AVX_VNNI        | DP2A(s32,s16,s16)     | 184.44 GOPS      |
| AVX512_BF16     | DP2A(f32,bf16,bf16)   | 80.477 GFLOPS    |
| AVX512_FP16     | FMA(f16,f16,f16)      | 355.76 GFLOPS    |
| AVX512F         | FMA(f32,f32,f32)      | 158.74 GFLOPS    |
| AVX512F         | FMA(f64,f64,f64)      | 79.375 GFLOPS    |
| FMA             | FMA(f32,f32,f32)      | 92.224 GFLOPS    |
| FMA             | FMA(f64,f64,f64)      | 46.115 GFLOPS    |
| AVX             | ADD(MUL(f32,f32),f32) | 67.789 GFLOPS    |
| AVX             | ADD(MUL(f64,f64),f64) | 33.9 GFLOPS      |
| SSE             | ADD(MUL(f32,f32),f32) | 34.43 GFLOPS     |
| SSE2            | ADD(MUL(f64,f64),f64) | 17.218 GFLOPS    |
--------------------------------------------------------------

For 56 cores:

$ ./cpufp --thread_pool=[0-55]
Number Threads: 56
Thread Pool Binding: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
--------------------------------------------------------------
| Instruction Set | Core Computation      | Peak Performance |
| AMX_INT8        | MM(s32,s8,s8)         | 293.86 TOPS      |
| AMX_INT8        | MM(s32,s8,u8)         | 309.81 TOPS      |
| AMX_INT8        | MM(s32,u8,s8)         | 293.44 TOPS      |
| AMX_INT8        | MM(s32,u8,u8)         | 293.07 TOPS      |
| AMX_BF16        | MM(f32,bf16,bf16)     | 141.12 TFLOPS    |
| AVX512_VNNI     | DP4A(s32,u8,s8)       | 39.629 TOPS      |
| AVX512_VNNI     | DP2A(s32,s16,s16)     | 19.772 TOPS      |
| AVX_VNNI        | DP4A(s32,u8,s8)       | 20.503 TOPS      |
| AVX_VNNI        | DP2A(s32,s16,s16)     | 10.236 TOPS      |
| AVX512_BF16     | DP2A(f32,bf16,bf16)   | 4.4223 TFLOPS    |
| AVX512_FP16     | FMA(f16,f16,f16)      | 19.761 TFLOPS    |
| AVX512F         | FMA(f32,f32,f32)      | 7.7876 TFLOPS    |
| AVX512F         | FMA(f64,f64,f64)      | 3.8961 TFLOPS    |
| FMA             | FMA(f32,f32,f32)      | 4.962 TFLOPS     |
| FMA             | FMA(f64,f64,f64)      | 2.4778 TFLOPS    |
| AVX             | ADD(MUL(f32,f32),f32) | 3.4637 TFLOPS    |
| AVX             | ADD(MUL(f64,f64),f64) | 1.7112 TFLOPS    |
| SSE             | ADD(MUL(f32,f32),f32) | 1.9122 TFLOPS    |
| SSE2            | ADD(MUL(f64,f64),f64) | 960.12 GFLOPS    |
--------------------------------------------------------------

Intel Xeon Gold 6455B

Microarchitecture: Sapphire Rapids

Setting: 2 Sockets x 32 Golden Cove Cores

For single core:

$ ./cpufp --thread_pool=[0]
Number Threads: 1
Thread Pool Binding: 0
--------------------------------------------------------------
| Instruction Set | Core Computation      | Peak Performance |
| AMX_INT8        | MM(s32,s8,s8)         | 6.3726 Tops      |
| AMX_INT8        | MM(s32,s8,u8)         | 7.5746 Tops      |
| AMX_INT8        | MM(s32,u8,s8)         | 7.5733 Tops      |
| AMX_INT8        | MM(s32,u8,u8)         | 7.5718 Tops      |
| AMX_BF16        | MM(f32,bf16,bf16)     | 3.7868 Tflops    |
| AVX512_VNNI     | DP4A(s32,u8,s8)       | 998.07 Gops      |
| AVX512_VNNI     | DP2A(s32,s16,s16)     | 499.07 Gops      |
| AVX_VNNI        | DP4A(s32,u8,s8)       | 498.96 Gops      |
| AVX_VNNI        | DP2A(s32,s16,s16)     | 249.47 Gops      |
| AVX512_BF16     | DP2A(f32,bf16,bf16)   | 115.16 Gflops    |
| AVX512_FP16     | FMA(f16,f16,f16)      | 499.08 Gflops    |
| AVX512F         | FMA(f32,f32,f32)      | 230.28 Gflops    |
| AVX512F         | FMA(f64,f64,f64)      | 115.17 Gflops    |
| FMA             | FMA(f32,f32,f32)      | 118.35 Gflops    |
| FMA             | FMA(f64,f64,f64)      | 62.385 Gflops    |
| AVX             | ADD(MUL(f32,f32),f32) | 91.59 Gflops     |
| AVX             | ADD(MUL(f64,f64),f64) | 45.85 Gflops     |
| SSE             | ADD(MUL(f32,f32),f32) | 46.493 Gflops    |
| SSE2            | ADD(MUL(f64,f64),f64) | 23.235 Gflops    |
--------------------------------------------------------------

For 64 cores:

$ ./cpufp --thread_pool=[0-63]
Number Threads: 64
Thread Pool Binding: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
--------------------------------------------------------------
| Instruction Set | Core Computation      | Peak Performance |
| AMX_INT8        | MM(s32,s8,s8)         | 390.67 Tops      |
| AMX_INT8        | MM(s32,s8,u8)         | 380.93 Tops      |
| AMX_INT8        | MM(s32,u8,s8)         | 391.32 Tops      |
| AMX_INT8        | MM(s32,u8,u8)         | 380.28 Tops      |
| AMX_BF16        | MM(f32,bf16,bf16)     | 192.47 Tflops    |
| AVX512_VNNI     | DP4A(s32,u8,s8)       | 48.114 Tops      |
| AVX512_VNNI     | DP2A(s32,s16,s16)     | 24.169 Tops      |
| AVX_VNNI        | DP4A(s32,u8,s8)       | 30.818 Tops      |
| AVX_VNNI        | DP2A(s32,s16,s16)     | 15.74 Tops       |
| AVX512_BF16     | DP2A(f32,bf16,bf16)   | 7.09 Tflops      |
| AVX512_FP16     | FMA(f16,f16,f16)      | 31.473 Tflops    |
| AVX512F         | FMA(f32,f32,f32)      | 14.329 Tflops    |
| AVX512F         | FMA(f64,f64,f64)      | 6.5406 Tflops    |
| FMA             | FMA(f32,f32,f32)      | 7.4039 Tflops    |
| FMA             | FMA(f64,f64,f64)      | 3.9067 Tflops    |
| AVX             | ADD(MUL(f32,f32),f32) | 5.4087 Tflops    |
| AVX             | ADD(MUL(f64,f64),f64) | 2.7339 Tflops    |
| SSE             | ADD(MUL(f32,f32),f32) | 2.9077 Tflops    |
| SSE2            | ADD(MUL(f64,f64),f64) | 1.4791 Tflops    |
--------------------------------------------------------------

AMD Ryzen7 8845HS

Architecture: Zen4

Setting: 8 Zen4 Cores

For single core:

$ ./cpufp --thread_pool=[0]
Number Threads: 1
Thread Pool Binding: 0
--------------------------------------------------------------
| Instruction Set | Core Computation      | Peak Performance |
| AVX512_VNNI     | DP4A(s32,u8,s8)       | 647.97 GOPS      |
| AVX512_VNNI     | DP2A(s32,s16,s16)     | 324.27 GOPS      |
| AVX512_BF16     | DP2A(f32,bf16,bf16)   | 324.92 GFLOPS    |
| AVX512F         | FMA(f32,f32,f32)      | 163.58 GFLOPS    |
| AVX512F         | FMA(f64,f64,f64)      | 81.786 GFLOPS    |
| FMA             | FMA(f32,f32,f32)      | 163.57 GFLOPS    |
| FMA             | FMA(f64,f64,f64)      | 81.785 GFLOPS    |
| AVX             | ADD(MUL(f32,f32),f32) | 157.36 GFLOPS    |
| AVX             | ADD(MUL(f64,f64),f64) | 79.045 GFLOPS    |
| SSE             | ADD(MUL(f32,f32),f32) | 80.34 GFLOPS     |
| SSE2            | ADD(MUL(f64,f64),f64) | 40.371 GFLOPS    |
--------------------------------------------------------------

For 8 cores:

$ ./cpufp --thread_pool=[0-7]
Number Threads: 8
Thread Pool Binding: 0 1 2 3 4 5 6 7
--------------------------------------------------------------
| Instruction Set | Core Computation      | Peak Performance |
| AVX512_VNNI     | DP4A(s32,u8,s8)       | 5113.8 GOPS      |
| AVX512_VNNI     | DP2A(s32,s16,s16)     | 2559.1 GOPS      |
| AVX512_BF16     | DP2A(f32,bf16,bf16)   | 2551.6 GFLOPS    |
| AVX512F         | FMA(f32,f32,f32)      | 1283.6 GFLOPS    |
| AVX512F         | FMA(f64,f64,f64)      | 641.21 GFLOPS    |
| FMA             | FMA(f32,f32,f32)      | 1271.7 GFLOPS    |
| FMA             | FMA(f64,f64,f64)      | 632.3 GFLOPS     |
| AVX             | ADD(MUL(f32,f32),f32) | 1193.6 GFLOPS    |
| AVX             | ADD(MUL(f64,f64),f64) | 590.85 GFLOPS    |
| SSE             | ADD(MUL(f32,f32),f32) | 613.54 GFLOPS    |
| SSE2            | ADD(MUL(f64,f64),f64) | 307.67 GFLOPS    |
--------------------------------------------------------------

AMD Ryzen9 6900HX

Architecture: Zen3+

Setting: 8 Zen3+ Cores

For single core:

$ ./cpufp --thread_pool=[0]
Number Threads: 1
Thread Pool Binding: 0
--------------------------------------------------------------
| Instruction Set | Core Computation      | Peak Performance |
| FMA             | FMA(f32,f32,f32)      | 151.84 GFLOPS    |
| FMA             | FMA(f64,f64,f64)      | 75.702 GFLOPS    |
| AVX             | ADD(MUL(f32,f32),f32) | 150.86 GFLOPS    |
| AVX             | ADD(MUL(f64,f64),f64) | 75.476 GFLOPS    |
| SSE             | ADD(MUL(f32,f32),f32) | 75.452 GFLOPS    |
| SSE2            | ADD(MUL(f64,f64),f64) | 37.737 GFLOPS    |
--------------------------------------------------------------

For 8 cores:

$ ./cpufp --thread_pool=[0,2,4,6,8,10,12,14]
Number Threads: 8
Thread Pool Binding: 0 2 4 6 8 10 12 14
--------------------------------------------------------------
| Instruction Set | Core Computation      | Peak Performance |
| FMA             | FMA(f32,f32,f32)      | 1057.8 GFLOPS    |
| FMA             | FMA(f64,f64,f64)      | 534.37 GFLOPS    |
| AVX             | ADD(MUL(f32,f32),f32) | 1037.6 GFLOPS    |
| AVX             | ADD(MUL(f64,f64),f64) | 516.21 GFLOPS    |
| SSE             | ADD(MUL(f32,f32),f32) | 518.32 GFLOPS    |
| SSE2            | ADD(MUL(f64,f64),f64) | 258.92 GFLOPS    |
--------------------------------------------------------------

Intel Core i5-1340P

Product Code Name: Raptor Lake

Setting: 4 Raptor Cove(P-Core) Cores + 8 Gracemont(E-Core) Cores

For single P-Core:

$ ./cpufp --thread_pool=[0]
Number Threads: 1
Thread Pool Binding: 0
--------------------------------------------------------------
| Instruction Set | Core Computation      | Peak Performance |
| AVX_VNNI        | DP4A(s32,u8,s8)       | 586.84 Gops      |
| AVX_VNNI        | DP2A(s32,s16,s16)     | 293.5 Gops       |
| FMA             | FMA(f32,f32,f32)      | 146.76 Gflops    |
| FMA             | FMA(f64,f64,f64)      | 73.373 Gflops    |
| AVX             | ADD(MUL(f32,f32),f32) | 107.7 Gflops     |
| AVX             | ADD(MUL(f64,f64),f64) | 53.512 Gflops    |
| SSE             | ADD(MUL(f32,f32),f32) | 54.49 Gflops     |
| SSE2            | ADD(MUL(f64,f64),f64) | 27.243 Gflops    |
--------------------------------------------------------------

For 4 P-Cores:

$ ./cpufp --thread_pool=[0,2,4,6]
Number Threads: 4
Thread Pool Binding: 0 2 4 6
--------------------------------------------------------------
| Instruction Set | Core Computation      | Peak Performance |
| AVX_VNNI        | DP4A(s32,u8,s8)       | 2.2454 Tops      |
| AVX_VNNI        | DP2A(s32,s16,s16)     | 1.1215 Tops      |
| FMA             | FMA(f32,f32,f32)      | 546.31 Gflops    |
| FMA             | FMA(f64,f64,f64)      | 267.62 Gflops    |
| AVX             | ADD(MUL(f32,f32),f32) | 356.72 Gflops    |
| AVX             | ADD(MUL(f64,f64),f64) | 176.89 Gflops    |
| SSE             | ADD(MUL(f32,f32),f32) | 183.39 Gflops    |
| SSE2            | ADD(MUL(f64,f64),f64) | 91.293 Gflops    |
--------------------------------------------------------------

For single E-Core:

$ ./cpufp --thread_pool=[8]
Number Threads: 1
Thread Pool Binding: 8
--------------------------------------------------------------
| Instruction Set | Core Computation      | Peak Performance |
| AVX_VNNI        | DP4A(s32,u8,s8)       | 108.5 Gops       |
| AVX_VNNI        | DP2A(s32,s16,s16)     | 54.251 Gops      |
| FMA             | FMA(f32,f32,f32)      | 54.248 Gflops    |
| FMA             | FMA(f64,f64,f64)      | 27.125 Gflops    |
| AVX             | ADD(MUL(f32,f32),f32) | 27.126 Gflops    |
| AVX             | ADD(MUL(f64,f64),f64) | 13.563 Gflops    |
| SSE             | ADD(MUL(f32,f32),f32) | 27.122 Gflops    |
| SSE2            | ADD(MUL(f64,f64),f64) | 13.561 Gflops    |
--------------------------------------------------------------

For 8 E-Cores:

$ ./cpufp --thread_pool=[8-15]
Number Threads: 8
Thread Pool Binding: 8 9 10 11 12 13 14 15
--------------------------------------------------------------
| Instruction Set | Core Computation      | Peak Performance |
| AVX_VNNI        | DP4A(s32,u8,s8)       | 791.36 Gops      |
| AVX_VNNI        | DP2A(s32,s16,s16)     | 395.68 Gops      |
| FMA             | FMA(f32,f32,f32)      | 395.67 Gflops    |
| FMA             | FMA(f64,f64,f64)      | 197.83 Gflops    |
| AVX             | ADD(MUL(f32,f32),f32) | 197.84 Gflops    |
| AVX             | ADD(MUL(f64,f64),f64) | 98.921 Gflops    |
| SSE             | ADD(MUL(f32,f32),f32) | 197.83 Gflops    |
| SSE2            | ADD(MUL(f64,f64),f64) | 98.916 Gflops    |
--------------------------------------------------------------

Intel N100

Product Code Name: Alder Lake-N

Setting: 4 Gracemont Cores

For single core:

$ ./cpufp --thread_pool=[0]
Number Threads: 1
Thread Pool Binding: 0
--------------------------------------------------------------
| Instruction Set | Core Computation      | Peak Performance |
| AVX_VNNI        | DP4A(s32,u8,s8)       | 108.51 GOPS      |
| AVX_VNNI        | DP2A(s32,s16,s16)     | 54.244 GOPS      |
| FMA             | FMA(f32,f32,f32)      | 54.247 GFLOPS    |
| FMA             | FMA(f64,f64,f64)      | 27.128 GFLOPS    |
| AVX             | ADD(MUL(f32,f32),f32) | 27.128 GFLOPS    |
| AVX             | ADD(MUL(f64,f64),f64) | 13.564 GFLOPS    |
| SSE             | ADD(MUL(f32,f32),f32) | 27.126 GFLOPS    |
| SSE2            | ADD(MUL(f64,f64),f64) | 13.563 GFLOPS    |
--------------------------------------------------------------

For 4 cores:

$ ./cpufp --thread_pool=[0-3]
Number Threads: 4
Thread Pool Binding: 0 1 2 3
--------------------------------------------------------------
| Instruction Set | Core Computation      | Peak Performance |
| AVX_VNNI        | DP4A(s32,u8,s8)       | 369.66 GOPS      |
| AVX_VNNI        | DP2A(s32,s16,s16)     | 185.09 GOPS      |
| FMA             | FMA(f32,f32,f32)      | 185.08 GFLOPS    |
| FMA             | FMA(f64,f64,f64)      | 92.55 GFLOPS     |
| AVX             | ADD(MUL(f32,f32),f32) | 92.546 GFLOPS    |
| AVX             | ADD(MUL(f64,f64),f64) | 46.269 GFLOPS    |
| SSE             | ADD(MUL(f32,f32),f32) | 92.546 GFLOPS    |
| SSE2            | ADD(MUL(f64,f64),f64) | 46.27 GFLOPS     |
--------------------------------------------------------------