Skip to content

Commit 18a6789

Browse files
authored
Small text corrections
1 parent 8bb3a6d commit 18a6789

File tree

1 file changed

+10
-7
lines changed

1 file changed

+10
-7
lines changed

instruction-level-parallelism/README.md

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
# Simple vectorization and instruction level parallelism
22

3+
In this exercise you can investigate with a simple microbenchmark how vectorization and instruction
4+
level parallelism show up in the performance. We suggest you to try here both Mahti and Puhti.
5+
36
[peak](../demos/peak) is microbenchmark, where one issues independent operations (fused multiply adds, multiplications, or additions). Operands can be kept in registers, so the
47
the code is purely compute bound and can reach close to theoretical peak
58
floating point performance. The code uses C++ template metaprogramming, but in pseudocode the main kernel looks like (for fused multiply add, multiply and add correspondingly):
@@ -23,21 +26,21 @@ by default or at lower optimization levels, and we encourage you to investigate
2326
also Intel and Clang compilers.
2427

2528
With the default options in [Makefile](../demos/peak/Makefile), GCC does not vectorize
26-
the code. Here, you can investigate how enabling vectorization affects performance.
29+
the code. Here, you can investigate how enabling vectorization affects performance.
2730

2831
- Compile and run the code first (with a single thread / core) with the default settings.
2932
- Enable then vectorization by adding the `-fopenmp-simd` option, how does the
3033
performance change?
3134
- Most of the current CPUs have vector units that can perform the fused multiply add
32-
operation in one instruction. However, as that changes the semantics of floating point
35+
operation in one instruction. However, that changes the semantics of floating point
3336
arithmetics a bit, and thus GCC does not use by default FMA instructions. Enable
3437
FMA instructions by adding `-mfma` option. How does the performance change?
3538
- In Puhti, try to use AVX512 instead of the default AVX2 (change `VECTOR_WIDTH` to 8
3639
and compile with `-mavx512f -mprefer-vector-width=512`).
3740
- Try to get compiler optimization report (e.g. by adding `-fopt-info-vec`) both
3841
with vectorization enabled and disabled.
39-
- We encourage you to look also into assembly code both with and without vectorization
40-
(you can produce `peak_xxx.s` file e.g. with `-S -fverbose-asm`). You can look
42+
- Try to look also into assembly code both with and without vectorization
43+
(you can produce `peak_xxx.s` file with `-S -fverbose-asm`). You can look
4144
e.g. for `xmm` (SSE), ´ymm` (AVX2), and `zmm` (AVX512) registers.
4245
As GCC does not inform about the use of FMA instructions in the optimization reports,
4346
looking into assembly is often the only way to find out if they are used.
@@ -58,7 +61,7 @@ concurrency = latency * throughput
5861
formula?
5962

6063
- In Mahti, AMD CPU cores have 16 AVX registers, the Intel CPU cores in Puhti 32. With
61-
large enough `NUM_OPS`, all operands can no longer be kept in registers. Try to work in
64+
large enough `NUM_OPS`, all operands can no longer be kept in registers. Try to work out on
6265
pen and paper how many registers are needed for different operations. What is the
6366
critical value for `NUM_OPS`? Investigate what happens to performance when this value is
6467
exceeded.
@@ -76,8 +79,8 @@ performance counters during the execution of a program. Available counters depen
7679
the underlying hardware, and can be seen with `perf list`.
7780

7881
- Repeat some of the previous tests with `perf` and try to get further understanding
79-
of the code by looking proper counters. As an example, number of instructions and
80-
cycles (number of instructions per cycle, IPC) can be seen with
82+
by looking into proper counters. As an example, number of instructions and
83+
cycles (and also number of instructions per cycle, IPC) can be seen with
8184
```
8285
srun ... perf stat -e instructions,cycles ./peak_fma
8386
```

0 commit comments

Comments
 (0)