The Architecture Independent Workload Characterization (AIWC -- pronounced | \ 'air-wik) tool is a plugin for the Oclgrind OpenCL simulator that gathers metrics of OpenCL programs that can be used to understand and predict program performance on an arbitrary given hardware architecture.
The tool has a docker file provided for rapid evaluation, a prebuilt image can be run directly with:
docker run beaujoh/aiwc:v1.1
Alternatively, to build your own image based on the provided Dockerfile, run:
docker build -t <yourname>/aiwc:v1.1 .
docker run <yourname>/aiwc:v1.1 .
Both images will launch a demonstration of using AIWC to profile LU Decomposition (an OpenCL implementations from the Extended OpenDwarfs Benchmark suite), and should support all OpenCL codes. To test on your own codes, run docker with an interactive session:
docker run -it beaujoh/aiwc:v1.1 bash
Set the following environment variables as desired
export OCLGRIND=/oclgrind
export OCLGRIND_SRC=/oclgrind-source
export OCLGRIND_BIN=/oclgrind/bin/oclgrind
The rest can be built with the following commands (tested on Ubuntu 20.04)
apt-get update && apt-get install --no-install-recommends -y libreadline-dev libclang-12-dev
git clone https://github.com/BeauJoh/AIWC.git $OCLGRIND_SRC
mkdir $OCLGRIND_SRC/build
cd $OCLGRIND_SRC/build
CC /llvm-12/bin/clang
CXX /llvm-12/bin/clang++
cmake $OCLGRIND_SRC -DCMAKE_BUILD_TYPE=RelWithDebInfo -DLLVM_DIR=/llvm-12/lib/cmake/llvm -DCLANG_ROOT=/llvm-12 -DCMAKE_INSTALL_PREFIX=$OCLGRIND -DBUILD_SHARED_LIBS=On
make
make install
mkdir -p /etc/OpenCL/vendors && echo $OCLGRIND/lib/liboclgrind-rt-icd.so > /etc/OpenCL/vendors/oclgrind.icd
To use AIWC over the command line it is passed the appropriate --aiwc
argument immediately after calling the oclgrind program.
An example of its usage on the OpenCL kmeans application is shown below:
oclgrind --aiwc ./kmeans <args>
The collected metrics are stored in a csv file, stored separately for each kernel and invocation. These files can be found in the working directory with the naming convention aiwc_α_β.csv
, where α
is the kernel name and β
is the invocation count.
Alternatively, Oclgrind can be used as a regular OpenCL device but AIWC flags can be used with the following environment variables:
OCLGRIND_WORKLOAD_CHARACTERISATION
, as an int/boolean to enable AIWC as the plugin used within Oclgrind, and,OCLGRIND_WORKLOAD_CHARACTERISATION_OUTPUT_PATH
, is a string used to denote the path where the AIWC metrics should be logged (as a csv).
For example:
OCLGRIND_WORKLOAD_CHARACTERISATION=1 OCLGRIND_WORKLOAD_CHARACTERISATION_OUTPUT_PATH=~/aiwc_metrics.csv ./kmeans <args>
To generate a markdown report of the metrics, run script/aiwc_report.py ./path/to/logfile
. This will print a variety of metrics, with some derived from others. To compare metrics between logs, run script/aiwc_report.py --compare ./logfile1 ./logfile2 ...
. This will print out values that are significantly different (threshold can be configured by modifying the script), and also generate a plot for each metric in a created plots
directory. To set the name of each log file, pass --names "name1;name2;..."
as an additional argument. Names default to the kernel name if not passed.
Metrics reported by AIWC should reflect important memory access patterns, control flow operations and available parallelism inherent to achieving efficiency across architectures. The following are the metrics collected by the AIWC tool ordered by type.
Type | Metric | Description |
---|---|---|
Compute | Opcode Total Instruction Count |
total # of unique opcodes required to cover 90% of dynamic instructions total # of instructions executed |
Parallelism | Work-items Total Barriers Hit Min ITB Max ITB Median ITB Min IPT Max IPT Median IPT Max SIMD Width Mean SIMD Width SD SIMD Width |
total # of work-items or threads executed total # of barrier instructions minimum # of instructions executed until a barrier maximum # of instructions executed until a barrier median # of instructions executed until a barrier minimum # of instructions executed per thread maximum # of instructions executed per thread median # of instructions executed per thread maximum # of data items operated on during an instruction mean # of data items operated on during an instruction standard deviation across # of data items affected |
Memory | Total Memory Footprint 90% Memory Footprint Unique Reads Unique Writes Unique Read/Write Ratio Total Reads Total Writes Reread Ratio Rewrite Ratio Global Memory Address Entropy Local Memory Address Entropy Relative Local Memory Usage Parallel Spatial Locality |
total # of unique memory addresses accessed # of unique memory addresses that cover 90% of memory accesses total # of unique memory addresses read total # of unique memory addresses written indication of workload being (unique reads / unique writes) total # of memory addresses read total # of memory addresses written indication of memory reuse for reads (unique reads/total reads) indication of memory reuse for writes (unique writes/total writes) measure of the randomness of memory addresses measure of the spatial locality of memory addresses proportion of all memory accesses to memory allocated as __local average of entropies of threads in a work group that share local memory |
Control | Total Unique Branch Instructions 90% Branch Instructions Yokota Branch Entropy Average Linear Branch Entropy |
total # of unique branch instructions # of unique branch instructions that cover 90% of branch instructions branch history entropy using Shannon's information entropy branch history entropy score using the average linear branch entropy |
For each OpenCL kernel invocation, the Oclgrind simulator AIWC tool collects a set of 28 metrics, which are listed in the associated Table. The Opcode, total memory footprint and 90% memory footprint measures are simple counts. Likewise, total instruction count is the number of instructions achieved during a kernels execution. The global memory address entropy (MAE) is a positive real number that corresponds to the randomness of memory addresses accessed. The local memory address entropy is computed as 10 separate values according to an increasing number of Least Significant Bits (LSB), or low order bits, omitted in the calculation.
The number of bits skipped ranges from 1 to 10, and a steeper drop in entropy with an increasing number of bits indicates greater spatial locality in the address stream. Both unique branch instructions and the associated 90% branch instructions are counts indicating the count of logical control flow branches encountered during kernel execution. Yokota branch entropy ranges between 0 and 1, and offers an indication of a program’s predictability as a floating point entropy value. The average linear branch entropy metric is proportional to the miss rate in program execution; p = 0 for branches always taken or not-taken but p = 0.5 for the most unpredictable control flow. All branch-prediction metrics were computed using a fixed history of 16-element branch strings, each of which is composed of 1-bit branch results (taken/not-taken).
As the OpenCL programming model is targeted at parallel architectures, any workload characterization must consider exploitable parallelism and associated communication and synchronization costs. We characterize thread-level parallelism (TLP) by the number of work- items executed by each kernel, which indicates the maximum number of threads that can be executed concurrently.
Work-item communication hinders TLP, and in the OpenCL setting, takes the form of either local communication (within a work-group) using local synchronization (barriers) or globally by dividing the kernel and invoking the smaller kernels on the command queue. Both local and global synchronization can be measured in instructions to barrier (ITB) by performing a running tally of instructions executed per work-item until a barrier is encountered under which the count is saved and resets; this count will naturally include the final (implicit) barrier at the end of the kernel. Min, max and median ITB are reported to understand synchronization overheads, as a large difference between min and max ITB may indicate an irregular workload.
Instructions per thread (IPT) based metrics are generated by performing a running tally of instructions executed per work-item until completion. The count is saved and resets. Min, max and median IPT are reported to understand load imbalance.
To characterize data parallelism, we examine the number and width of vector operands in the generated LLVM IR, reported as max SIMD width, mean SIMD width and standard deviation – SD SIMD width. Further characterisation of parallelism is presented in the work-items and total barriers hit metrics.
Some of the other metrics are highly dependent on workload scale, so work-items may be used to normalize between different scales. For example, total memory footprint can be divided by work-items to give the total memory footprint per work-item, which indicates the memory required per processing element.
On Relative Local Memory Usage (RLMU): This measures the proportion of all memory accesses from the symbolic execution of the kernel that occurred to memory allocated as __local
.
On GPUs, this memory address space is mapped to fast on-chip shared memory.
Relative local memory usage is an example of a metric that is useful to measure performance-critical access patterns on some architectures such as GPUs, and not others, such as CPUs.
CPUs do not typically have a notion of user-controlled on-chip memory shared between hardware threads such as GPUs’ shared memory.
This is a natural consequence of programming for a heterogeneous system.
Specific code patterns may translate to performance improvements only on certain hardware.
RLMU performs the calculation of memory address entropy by using virtual addresses to calculate MAE values using an abstract ideal address space on which all memory accesses by the kernel occur.
This allows AIWC to accurately abstract over the hardware and ISA-specific differences in memory layouts across the diverse hardware targets.
On Parallel Spatial Locality (PSL): Aggregate metrics of the kind presented by AIWC necessarily present a simplified view of program behaviour, omitting many details.
Different ways of aggregating program measurements lead to different features of program execution being emphasised in the final metrics.
For example, the calculation of memory address entropy relies only on the frequency distribution of memory accesses to all addresses accessed by the kernel, and discards temporal information.
The order of sequential memory accesses performed by each thread, as well the relationship between work items in an OpenCL work work group, are both vital in accurately characterizing parallel codes.
To this end, PSL is a parallel computing analogue for MICA’s data stride metric that measures the distance between consecutive data accesses in a single-threaded environment.
In parallel programs, to accurately measure spatial locality of accesses, we must consider memory accesses performed by multiple threads in a close temporal scope.
PSL thus calculates the locality of accesses in each time step of the program's execution; The steeper reductions of n
-bits-dropped in parallel spatial locality scores will be observed in programs that often access nearby memory addresses within the same timestamp.
Such programs will perform better on GPUs, as they will make better use of both global memory access coalescing and shared memory bank structures.
To a lesser extent, the proposed metric reflects performance-critical memory access patterns on CPUs, as pulling a single cache line from global memory into last-level cache may improve memory access times for all CPU cores.
Finally, unique verses absolute reads and writes can indicate shared and local memory reuse between work-items within a work-group, and globally, which shows the predictability of a workload. To present these characteristics the unique reads, unique writes, unique read/write ratio, total reads, total writes, reread ratio, rewrite ratio metrics are proposed. The unique read/write ratio shows that the workload is balanced, read intensive or write intensive. They are computed by storing read and write memory accesses separately and are later combined, to compute the global memory address entropy and local memory address entropy scores.
The following are examples of projects that have heavily used AIWC to perform analysis---either for performance predictions or workload characterization:
- https://github.com/BeauJoh/aiwc-opencl-based-architecture-independent-workload-characterization-artefact
- https://github.com/BeauJoh/opencl-predictions-with-aiwc
If you use AIWC, please cite Oclgrind and the most appropriate of the following papers:
- Characterizing and Predicting Scientific Workloads for Heterogeneous Computing Systems
- AIWC: OpenCL-Based Architecture-Independent Workload Characterization
- OpenCL Performance Prediction using Architecture-Independent Features
- Characterizing Optimizations to Memory Access Patterns using Architecture-Independent Program Features
For issues and questions with AIWC please contact Beau Johnston beau@inbeta.org or over GitHub:
https://github.com/beaujoh/AIWC/issues
For additional information on Oclgrind--on which this plugin is built--please check out https://github.com/jrprice/Oclgrind