This repository contains a template-based microbenchmark generation framework, and a family of microbenchmarks for the A64 instruction set.
The microbenchmark code generation uses the Inja template engine [1] and JSON for Modern C++ header-only libraries [2].
Benchmarks can be instrumented to be run on the gem5 simulator [3] or for use with the LIKWID performance monitoring suite [4].
# Clone this repository
git clone https://github.com/FZJ-JSC/ietubench.git
# Configure build using CMake
cmake -B build/ietubench ietubench -D CMAKE_BUILD_TYPE="Release"
# Optional: Configure build using CMake and link with LIKWID
# cmake -B build/ietubench-lk ietubench -D CMAKE_BUILD_TYPE="Release" -D PROJECT_USE_LIKWID="ON"
# Build all microbenchmarks defined in src/micro subdirectory
make -C build/ietubench -j$(nproc)
Benchmark | Directory |
---|---|
Instruction Execution Throughput | micro/iet |
Core-to-core Latency | micro/c2c |
Branch Prediction | micro/bp |
Instruction Execution Throughput microbenchmarks execute a single instruction
or small group of instructions over a loop of length
and execution throughput
Other examples of microbenchmark implementations to measure instruction execution rate are [5] and [6].
# Benchmark latency/throughput of integer ADD using a loop of length L=64,
# iterating the main loop 100 times and repeating the benchmark 150 times
./src/micro/iet/add_a_64.x -l 100 -r 150
# Benchmark latency/throughput of integer ADD and 64 bit FP fadd for loops
# of length L=16,64,256,1024,4096 and print the output in CSV format
for L in 16 64 256 1024 4096 ; do ./src/micro/iet/add_fadd64_b_$L.x -l 100 -r 100 -o | awk -vT="$L" '{ print T "," $0 }' ; done | tee iet.csv
The core-to-core latency
We might define the core-to-core latency
WAIT_ACQ:
ldar x1, [#local]
cmp x1, x2
bls WAIT_ACQ
add x2, x1, #1
stlr x2, [#remote]
The WAIT_ACQ loop is described in section K14.2.1 of [7] as a way to implement message passing for weakly-ordered memory architectures. The WAIT_ACQ loop is implemented in c2c/ldar.
# Benchmark core-to-core message passing latency for CPUs m and n,
# iterating the main loop l times and repeating the benchmark r times
./src/micro/c2c/ldar_256.x -t m,n -l l -r r
# Run the c2c benchmark for all pairs [(0,1), .. ,(0,63)]
for n in $(seq 1 63); do ./src/micro/c2c/ldar_256.x -t 0,$n -l 1000 -r 10 -o | awk -vT="$n" '{ print T "," $0 }' ; done | tee c2c.csv
Consider a sequence
Consider the loop shown below and let [x10]
point
to an integer array constructed as described above.
Each array element read is compared to a given integer input parameter
The integer parameter
LOOP:
ldrb w9, [x10], 1
cmp x9, x11
bcc .TARGET_1
.BACK_FROM_TARGET_1:
ldrb w9, [x10], 1
cmp x9, x11
bcc .TARGET_2
.BACK_FROM_TARGET_2:
...
The loop is implemented in bp/bcc.
A similar experiment using arrays of random integers to benchmark branch prediction is presented in [8].
# Use LIKWID to measure selected Armv8 performance counters on processor 8 for all values of p=1..100
export CTR="CPU_CYCLES:PMC0,INST_RETIRED:PMC1,BR_PRED:PMC2,BR_RETIRED:PMC3,BR_MIS_PRED:PMC4,BR_MIS_PRED_RETIRED:PMC5"
for p in $(seq 1 100);
do likwid-perfctr -C 8 -c 8 -g $CTR -o csv/bcc_b_4096_p${p}.csv -m ./src/micro/bp/bcc_b_4096-lk.x -r 100 -l 100 -f s
rc/data/data_s409600_r100.dat -p $p;
done
[1] https://github.com/pantor/inja
[2] https://github.com/nlohmann/json
[3] gem5 computer-system architecture simulator
[4] LIKWID Performance monitoring and benchmarking suite
[5] ChipsandCheese/Microbenchmarks
[6] Cryptographic libraries comparative benchmarks
[7] Arm Architecture Reference Manual for A-profile architecture