This is a replication package containing code and experimental results related to a Euro-Par 2023 paper titled: Pure C++ Approach to Optimized Parallel Traversal of Regular Data Structures
The running code examples in the paper are lifted from the experimental implementations involved in the measurements (see the Overview section). The following list maps the examples to the corresponding files:
-
Section 3.1 (Introducing syntax for traversers)
- First example: ./simplified/cpu-matmul-blocked-noarr-bag.cpp:10 - simplified from matmul/cpu-blk-noarr-bag.cpp
- Second example: matmul/cpu-triv-noarr-bag.cpp:19
- Third example: simplified/cpu-matmul-blocked-noarr-bag.cpp:13 - simplified from matmul/cpu-blk-noarr-bag.cpp
- Fourth example: simplified/cpu-matmul-layered-noarr-bag.cpp:17 - simplified from matmul/cpu-blk-noarr-bag.cpp
-
Section 3.2 (Parallel execution)
- First example: histo/cpu-all-noarr-bag.cpp:52
- Second example: histo/cpu-all-noarr-bag.cpp:64
-
Section 4.1 (CUDA traverser)
- First example: histo/cu-triv-noarr-bag.cu:9
- Second example: simplified/cu-simple-noarr-bag.cu:62
-
Section 4.2 (Shared memory privatization)
- First example: simplified/cu-simple-noarr-bag.cu:11, mainly lines 35 - 38
- Second example: simplified/cu-simple-noarr-bag.cu:69
- Third example: simplified/cu-simple-noarr-bag.cu:17
- Fourth example: simplified/cu-simple-noarr-bag.cu:42
All simplified from histo/cu-priv-noarr-bag.cu
For the experimental side of the work: the Overview section gives a summary of all relevant source files, test input generators, and test scripts; the Build section then goes through the build requirements and build instructions for the experimental implementations; and, finally, the Experiments section provides the specifics about experiment methodology and data visualization.
The last section, Bonus, presents more advanced uses of the Noarr library and showcases some implementations that should show the viability of our approach in algorithms outside linear algebra.
For a better look at the Noarr library itself, see https://github.com/ParaCoToUl/noarr-structures/ (the main repository for the library) or https://link.springer.com/chapter/10.1007/978-3-031-22677-9_27 (the paper associated with the initial proposal of the Noarr abstraction).
The following files are relevant when performing the experiments.
-
Makefile: Contains build instructions for all the experimental implementations and their inputs
- The Build section of the artifact specifies its use
-
matmul: The directory containing all matrix multiplication implementations used in the measurements.
-
Files beginning with either
cu-
, orcpu-
: The implementations of the matrix multiplication algorithm, each file contains just the part that defines and executes the computation.Each such implementation has up to 4 versions (distinguished by the last word before the file extension; for each implementation, two of these are written using our proposed Noarr abstraction - with Noarr bags or without; others serve as the pure-C++ baseline) - each group of corresponding different versions is programmed to perform the exact same sequence of memory instructions (before any compiler optimizations).
The sources
cpu-blk-*.cpp
andcpu-triv-*.cpp
use theLOG
macro (defined in a corresponding*main.hpp
file) for debugging purposes. The macro uses also serve as comments explaining the algorithm (the preprocessor ignores the macro argument unlessLOGGING
is defined). -
noarrmain.hpp, policymain.hpp, plainmain.hpp: Header files that define memory management, IO, and time measurement.
-
gen-matrices.py: A Python script used to prepare input matrices with random content.
-
-
histo: The directory containing all histogram implementations used in the measurements.
-
Files beginning with either
cu-
, orcpu-
: The implementations of the histogram algorithm, each file contains just the part that defines and executes the computation.Each such implementation has 3 versions (distinguished by the last word before the file extension; for each implementation, two of these are written using our proposed Noarr abstraction - with Noarr bags or without; others serve as the pure-C++ baseline) - each group of corresponding different versions is programmed to perform the exact same sequence of memory instructions (before any compiler optimizations).
-
histomain.cpp: Header file that defines memory management, IO, and time measurement.
-
gen-matrices.py: A shell script used to generate a random text with uniformly distributed characters.
-
-
matmul-measure.sh and histo-measure.sh: Shell scripts used to measure the runtimes for matrix multiplication and histogram, respectively.
-
matmul-validate.sh and histo-validate.sh: Shell scripts used to test whether all the corresponding implementations of the given algorithm produce the same result, given the same input.
-
matmul-blk-test.sh and matmul-triv-test.sh: Debugging scripts that compare the work process of the
cpu-blk-*
andcpu-triv-*
implementations, respectively, with many different configurations. The comparison is performed via explicit logging of arithmetic and memory operations.
This section is dedicated to build requirements and build instructions.
All CPU implementations require a C++20-compliant compiler (we use C++20 only for better convenience of setup; the Noarr library itself is both C++17-compliant and C++20-compliant), and all CUDA implementations require a C++17-compliant version of the nvcc compiler. The compilers are also expected to support all options specified in Makefile, but -O3
should be sufficient to get close-enough results.
Apart from the noarr-structutres library, the implementations themselves have no external dependencies other than TBB (but that is limited to just 2 implementations; use NoTBB/Makefile to skip the TBB implementations).
Other requirements:
For generating the input files, Python is required along with the following Python packages (in as current versions as possible - as of 3-3-2023):
pandas
numpy
scipy
scikit-learn
make all
builds all matrix multiplication, histogram and k-means examples. All compilation should proceed with no warnings if all the requirements are met.
- replace
all
withmatmul
,histo
orkmeans
, to build the implementations of only one algorithm (also applies to the following commands).
make all CLANG=@: GCC=@:
builds all matrix multiplication and histogram CUDA examples.
make all NVCC=@:
builds all matrix multiplication and histogram C++ examples using the clang++ and g++ compilers (you can also add CLANG=@:
or GCC=@:
to prevent the use of the respective compiler; or you can override one of them to use a preferred C++ compiler).
make generate
generates matrix multiplication input files with sizes ranging from 2**10
to 2**13
; and the random text for the histogram algorithm.
make generate-small
generates matrix multiplication input files with sizes ranging from 2**6
to 2**10
.
The git branches gpulab
and parlab
, each, contain the ./run_job.sh
used to perform the experiments on a particular machine in a Slurm-managed cluster - measuring GPU implementations and CPU implementations, respectively. The measurements consist of runtimes per one whole computation with any execution overhead included (but without the input preparation, and output); for each input configuration, the measurements consist of 10 repetitions of a single batch consisting of a random shuffle of the different implementation versions. This whole process is performed for the histogram and matrix multiplication algorithms separately, in alteration, for another 10 times. In total, for each different input configuration, each version of a given algorithm implementation was measured 100 times - which was sufficient enough to get consistent results.
-
gpulab
results were collected on a Linux machine equipped with Tesla V100 PCIe 16GB, Driver Version: 525.85.12, and CUDA Version: 12.0 -
parlab
results were collected on a Linux machine equipped with Intel(R) Xeon(R) Gold 6130 CPU, and g++ (GCC) 12.2.0- apart from the
TBB
histogram example, all CPU implementations were run in a single thread - as we were primarily interested in any possible abstraction overheads; with the TBB-reduction overhead measured separately as a part of theTBB
examples.
- apart from the
Both git branches also contain results.tar.gz
with experimental results stored in the CSV format. These were collected into all-results.tar.gz for convenience.
The measured data are visualized using plots generated by matmul.R and histo.R R-scripts that are written using the ggplot2, dplyr and stringr libraries.
make plots
runs both of the R scripts. The scripts should recreate the content of the plots directory.
This section collects topics that did not fit within the paper (due to the length limits, or ongoing development).
The traversals performed by the Noarr traversers can be transformed by applying the .order()
method with an argument that specifies the desired transformation, these transformations include: (the list is non-exhaustive; the Noarr library also defines many shortcuts for various transformation combinations)
-
into_blocks<OldDim, NewMajorDim, NewMinorDim>(block_size)
- Separates the specified dimension into blocks according to the given block size; each index from the original index space is then uniquely associated with a major index(block index) and a minor index. The dimension does not have to be contiguous. However, the transformation always assumes the dimension length is divisible by the block size; otherwise, some of the data are associated with indices outside the accessible index space and thus cannot be traversed.For the cases where the original dimension length is not guaranteed to be divisible by the block size, the Noarr library defines:
-
into_blocks_static<OldDim, NewIsBorderDim, NewMajorDim, NewMinorDim>(block_size)
- TheNewIsBorderDim
the top-most dimension of the three. It has a length of2
; and, if it is given the index0
, the rest ofinto_blocks_static
behaves just likeinto_blocks
; if it is, instead, given the index1
(corresponding to the border consisting of the remaining data that do not form a full block), the length of the major dimension is set to1
(there is only one block), and the length of the minor dimension is set to the division remainder (modulo) after dividing the original dimension length by the block size. -
into_blocks_dynamic<OldDim, NewMajorDim, NewMinorDim, NewDimIsPresent>(block_size)
- The specified dimension is divided into blocks according to the given blocks size using round-up division. This means that each index within the original index space is uniquely associated with some major index (block index) and some minor index and it is always accessible within the new index space (contrasting the simpleinto_blocks
). Without the is-present dimension, this would allow for accessing items outside the original index space - for such items (we can recognize them byoldLength <= majorIdx * block_size + minorIdx
beingtrue
), the is-present dimension length is0
; otherwise, it is1
. This behavior of the is-present dimension ensures that traversing over the legal items behaves as expected, while the traversal over the items outside the original index space traverses0
items.This essentially emulates a pattern commonly found in algorithms for massively parallelized platforms (such as CUDA), where we map each thread to an element of an array (we assume an array for simplicity - but we can generalize) and then each thread starts its work by checking whether its corresponding index falls within the bounds of the array. For this case, we might use
into_blocks_dynamic<ArrayDim, GridDim, BlockDim, DimIsPresent>
and then useauto ct = cuda_threads<GridDim, BlockDim>(transformed_traverser)
to bind the appropriate dimensions, the traversal performed by thect.inner()
inner traverser then transparently checks the array bound as if we used theif (x >= array_length) return;
pattern manually. -
there is also a symmetrical transformation called
merge_blocks
-
-
sized_vector<Dim>(size)
- Adds a new dimension to the traversal. This has limited usefulness to traversal transformations, but it is the most basic transformation for memory layouts. -
fix<Dim>(value)
,fix<Dims...>(values...)
,fix(state)
- Limits the traversal to a certain row, column, block, etc. -
bcast<Dim>(length)
- Creates a dimension that just consumes an index from a certain range of values. The result is not affected by any particular index from the given range.- Typically used as a placeholder: e.g. when binding dimensions via
cuda_threads
; or when adding blocking to some implementation configurations, we can usebcast
in other configurations to preserve the same API.
- Typically used as a placeholder: e.g. when binding dimensions via
-
rename<OldDim, NewDim>
- Renames a dimension of a structure. This is especially useful when we want to specify that two (or more) structures are to be traversed in parallel along some axes (in such cases, we would typically apply the rename directly to the given structure(s)). -
shift<Dim>(offset)
- Makes the index provided for the specified dimension mapped to another that is shifted by the given offset; also shortens the length of the traversal through the dimension accordingly so all shifted indices are within the original range. -
slice<Dim>(offset, length)
- This is a generalization ofshift
that also explicitly restricts the length of the specified dimension. This is useful whenever we require a traversal through a contiguous subset of indices. An example of a typical use case is a stencil.
It is noteworthy that the same transformations can be applied to memory layouts and traversals alike. The transformations are programmed so they accept and propagate type-literal arguments to enable further optimizations. It is also possible to provide more, user-defined, transformations as needed.
Noarr is not viable just in simple computation kernels performing work on some regular structures. The implementation of this algorithm showcases some of the more advanced uses of the library, including a simple data (de)serialization example.
-
kmeans contains three versions of an implementation of the k-means algorithm (two Noarr versions and one in pure C++). One Noarr version uses Noarr bags and the other one does not. The names follow the naming scheme of the previously mentioned two algorithms.
-
kmeans-gen.py: A Python script that generates data that are very likely to be clustered deterministically into the expected number of clusters. It also attempts, using a simple heuristic, to generate such data that the process is maximally resistant to numeric errors caused by floating-point arithmetics.
make generate-kmeans
uses this script to generate a few test inputs (it uses a pre-specified seed to promote replicability).
-
The k-means experiments can be performed by the following process:
make kmeans
make generate-kmeans
./kmeans-validate.sh && ./kmens-measure.sh
This experiment also includes a comparison against the sklearn
k-means algorithm configured to perform the same job. It is included mainly to provide the correct output for validation.
This algorithm does not have an experiment associated with it as the used abstractions are not yet fully optimized, but it is included to showcase the flexibility of the presented abstractions.
-
stencil: This directory, as usual, contains three versions of a simple implementation of a blur stencil (performing flat blur with a specified neighborhood radius); two versions that use the Noarr abstraction (one with Noarr bags and the other without), one that is written in pure C++.
-
draw_image.py: Generates an image represented as a sequence of
height * width * channels
numbers ranging from 0 to 256. The image consists of a noisy rectangle in the middle of a black field. -
draw_image.py: Transforms a sequence of
height * width * channels
numbers into an image. (If it can be interpreted as one.)This is included just to help visualize that the algorithm performs the expected work.
-
Build:
# alternatively, link it, if it is already cloned elsewhere
git clone https://github.com/ParaCoToUl/noarr-structures/
mkdir -p build/stencil/
<FAVORITE_COMPILER> -std=c++20 -Ofast -Wall -DNDEBUG \
-o build/stencil/noarr -Inoarr-structures stencil/cpu-stencil-noarr.cpp
<FAVORITE_COMPILER> -std=c++20 -Ofast -Wall -DNDEBUG \
-o build/stencil/noarr-bag -Inoarr-structures stencil/cpu-stencil-noarr-bag.cpp
<FAVORITE_COMPILER> -std=c++20 -Ofast -Wall -DNDEBUG \
-o build/stencil/plain -Inoarr-structures stencil/cpu-stencil-plain.cpp
Run:
build/stencil/<BINARY_NAME> <input> <height> <width> <channels> <radius>