You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: instruction-level-parallelism/README.md
+10-7Lines changed: 10 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,8 @@
1
1
# Simple vectorization and instruction level parallelism
2
2
3
+
In this exercise you can investigate with a simple microbenchmark how vectorization and instruction
4
+
level parallelism show up in the performance. We suggest you to try here both Mahti and Puhti.
5
+
3
6
[peak](../demos/peak) is microbenchmark, where one issues independent operations (fused multiply adds, multiplications, or additions). Operands can be kept in registers, so the
4
7
the code is purely compute bound and can reach close to theoretical peak
5
8
floating point performance. The code uses C++ template metaprogramming, but in pseudocode the main kernel looks like (for fused multiply add, multiply and add correspondingly):
@@ -23,21 +26,21 @@ by default or at lower optimization levels, and we encourage you to investigate
23
26
also Intel and Clang compilers.
24
27
25
28
With the default options in [Makefile](../demos/peak/Makefile), GCC does not vectorize
26
-
the code. Here, you can investigate how enabling vectorization affects performance.
29
+
the code. Here, you can investigate how enabling vectorization affects performance.
27
30
28
31
- Compile and run the code first (with a single thread / core) with the default settings.
29
32
- Enable then vectorization by adding the `-fopenmp-simd` option, how does the
30
33
performance change?
31
34
- Most of the current CPUs have vector units that can perform the fused multiply add
32
-
operation in one instruction. However, as that changes the semantics of floating point
35
+
operation in one instruction. However, that changes the semantics of floating point
33
36
arithmetics a bit, and thus GCC does not use by default FMA instructions. Enable
34
37
FMA instructions by adding `-mfma` option. How does the performance change?
35
38
- In Puhti, try to use AVX512 instead of the default AVX2 (change `VECTOR_WIDTH` to 8
36
39
and compile with `-mavx512f -mprefer-vector-width=512`).
37
40
- Try to get compiler optimization report (e.g. by adding `-fopt-info-vec`) both
38
41
with vectorization enabled and disabled.
39
-
-We encourage you to look also into assembly code both with and without vectorization
40
-
(you can produce `peak_xxx.s` file e.g. with `-S -fverbose-asm`). You can look
42
+
-Try to look also into assembly code both with and without vectorization
43
+
(you can produce `peak_xxx.s` file with `-S -fverbose-asm`). You can look
41
44
e.g. for `xmm` (SSE), ´ymm` (AVX2), and `zmm` (AVX512) registers.
42
45
As GCC does not inform about the use of FMA instructions in the optimization reports,
43
46
looking into assembly is often the only way to find out if they are used.
0 commit comments