This repository contains parallel patterns implementations of some applications contained in the PARSEC benchmark.
All the applications (except x264) have been implemented by using the FastFlow pattern-based parallel programming framework. Some benchmarks have been also implemented with the SkePU2 framework and other with the C++ Actor Framework (CAF). In the following table you can find more details about the pattern used for each benchmark and the file(s) containing the actual implementation, both for FastFlow, SkePU2 and CAF. The pattern descriptions reported here are an approximation and exact descriptions will come later. Some benchmarks are implemented by using different patterns (bold pattern is the one used by default). To run the benchmark a different pattern refer to the specific section of this document.
Application | Used Pattern | FastFlow Files | SkePU2 Files | CAF Files |
---|---|---|---|---|
Blackscholes | Map | File 1 | File 1 | File 1 |
Bodytrack | Maps | File 1, File 2 | ||
Canneal | Master-Worker | File 1 | File 1 | |
Dedup | Pipeline of Farms | File 1 | ||
" | Farm | File 1 | ||
" | Farm of Pipelines | File 1 | ||
" | Ordering Farm | File 1 | ||
Facesim | Maps | File 1, File 2, File 3, File 4 | ||
Ferret | Pipeline of Farms | File 1 | File 1 | |
" | Farm of Pipelines | File 1 | File 1 | |
" | Farm | File 1 | ||
" | Farm (Optimized) | File 1 | File 1 | |
Fluidanimate | Maps | File 1 | ||
Freqmine | Maps | File 1 | ||
Raytrace | Map | File 1 | File 1 | File 1 |
Streamcluster | Maps and MapReduce | File 1 | File 1 | |
Swaptions | Map | File 1 | File 1 | |
Vips | Farm | File 1 | ||
x264 | Not available. |
These implementations have been engineered in order to be used with the standard PARSEC tools. Accordingly, you can use and evaluate the parallel patterns implementations together with the Pthreads, OpenMP and TBB versions already present in PARSEC. After following this guide, more details can be found on PARSEC Website.
To download the last version of P3RSEC, run the following commands:
wget https://github.com/ParaGroup/p3arsec/archive/v1.0.tar.gz
tar -xvf v1.0.tar.gz
cd p3arsec-1.0
Then, run:
./install.sh
These commands could take few minutes to complete, since it will download the original PARSEC implementations with all the input datasets (around 3GB) and all the needed dependencies.
You can specify the following parameters to the ./install.sh
command:
--nomeasure
: In this case the infrastructure for measuring execution time and energy consumption will not be installed. If this parameter is not specified, then you will be able to measure execution time and energy consumption for all the benchmarks (both for those implemented as parallel patterns and for those already present in PARSEC).--fast
: By specifying this parameter you will only download the PARSEC source code (112MB) and some small test inputs. You can download the input datasets later by running./install.sh --inputs
--inputs
: This parameter will only download the PARSEC input files. It should be used only if./install.sh --fast
has already been run.--skeputools
: This parameter compiles and install the SkePU2 source to source compiler. This is not mandatory and you only need it if you want to modify the*_skepu.cpp
files. This parameter is mainly intended for developers.
To let PARSEC properly work, some dependencies needs to be installed. For Ubuntu systems, you can do it with the following command:
sudo apt-get install git build-essential m4 x11proto-xext-dev libglu1-mesa-dev libxi-dev libxmu-dev libtbb-dev libssl-dev
For Arch Linux, the following:
sudo pacman -Sy git m4 xorgproto glu libxi libxmu intel-tbb openssl
Similar packages can be found for other Linux distributions.
After that, you need to install the benchmarks you are interested in:
cd bin
The parallel patterns versions of the benchmarks have been integrated with the original PARSEC management system
(./parsecmgmt
).
You can find the full documentation here or
in the README_PARSEC file which will appear in the directory after the previous commands have been run.
To compile the parallel patterns version of a specific benchmark, is sufficient to run the following command:
./parsecmgmt -a build -p [BenchmarkName] -c gcc-ff
If you also want to compile the other existing versions of the benchmark, just replace gcc-ff
with one of the following:
- gcc-skepu for the SkePU2 parallel pattern-based implementation.
- gcc-pthreads for the Pthreads implementation.
- gcc-openmp for the OpenMP implementation.
- gcc-tbb for the Intel TBB implementation.
- *gcc-caf for the CAF implementation
Note that not all these implementations are available for all the benchmarks. For more details on supported implementations, please refer to the original PARSEC documentation (and to the top table in this file for the SkePU2 and FastFlow versions).
ATTENTION: If you plan to execute the benchmark with more than 1024 threads, you need to modify the following MACROS:
MAX_THREADS
inpkgs/apps/blackscholes/src/c.m4.pthreads
file.MAX_NUM_THREADS
inpkgs/libs/fastflow/ff/config.hpp
file.
Once you compiled a benchmark, you can run it with:
./parsecmgmt -a run -p [BenchmarkName] -c gcc-ff -n [ConcurrencyLevel]
Even in this case, you can run the other existing version by replacing gcc-ff
with the name of the desired version.
By default, the program is run on a test input. PARSEC provides different input datasets: test,
simsmall, simmedium, simlarge, simdev and native.
The native input set is the one resembling a real execution scenario, while the others should be used for testing/simulation
purposes. To specify the input set, is sufficient to specify it with the -i
parameter. For example, to run
the parallel patterns implementation of the Canneal benchmark on the native input set:
./parsecmgmt -a run -p canneal -c gcc-ff -i native
All the datasets are present if you ran ./install.sh
(or ./install.sh --fast
plus ./install.sh --inputs
).
ConcurrencyLevel has the same meaning it has in the original PARSEC benchmarks. It represents the concurrency level and it is the minimum number of threads that will be activated by the application. Accordingly, we have the following values:
- blackscholes: n+1 threads.
- canneal: n+1 threads.
- dedup: n threads for each pipeline stage (3n + 3 threads). (For the pipe of farms version.)
- ferret: n threads for each pipeline stage (4n + 4 threads). (For the pipe of farms version.)
- swaptions: n+1 threads.
Some parallel patterns implementations may not follow this rule. For example, the ordered farm implementation of the dedup benchmark will activate n+2 threads.
If you want to measure energy consumption of the benchmarks (and if you do not specified the --nomeasure
parameter in
the ./install.sh
script),
please run the benchmarks with sudo
. In this case, in the output of the program you will find something like:
sudo ./parsecmgmt -a run -p canneal -c gcc-ff -i native
...
roi.time|12.3
roi.joules|TYPE|456.7
...
Where 12.3 is the execution time and 456.7 is the energy consumption. Both values consider only the time and energy spent in the Region Of Interest (ROI), i.e. the parallel part of the application, excluding initialisation and cleanup phases (e.g. loading a dataset from the disk to the main memory). This approach is commonly used in scientific literature to evaluate PARSEC behaviour.
Energy measurements are provided through the Mammut library. The meaning of the energy consumption value depends on the type of energy counters available on the running architecture. TYPE can be one of the followings:
-
CPUS
: In this case, 4 values will be printed TAB separated (e.g.roi.joules|CPUS|400 300 0 20
).- The first value represents the energy consumed by all the CPUs/Sockets on the machine.
- The second value represents the energy consumed by only the cores on the CPUs.
- The third value represents the energy consumed by the DRAM controllers. This counter may not be available on some architectures. In this case, 0 is printed.
- The fourth value is architecture dependent. In general it represents the energy consumed by the integrated graphic card. This counter may not be available on some architectures. In this case, 0 is printed.
This counter is available on newer Intel architectures (Silvermont, Broadwell, Haswell, Ivy Bridge, Sandy Bridge, Skylake, Xeon Phi KNL). If you need more detailed measurements (e.g. separating the consumption of individual sockets, please contact us).
-
PLUG
: In this case only one value will be printed, corresponding to the total energy consumption of the machine (measured at the power plug level. This counter is available on:- Architectures using a SmartPower.
- IBM Power8 machines. This support is still experimental. If you need to use it, please contact us.
If energy counters are not present, only execution time will be printed.
Some applications (e.g. ferret and dedup) have been implemented according to different pattern compositions. To run versions different from the default one, you need first to remove the existing one (if present). To do so, execute:
./parsecmgmt -a fullclean -c gcc-ff -p [BenchmarkName]
./parsecmgmt -a fulluninstall -c gcc-ff -p [BenchmarkName]
To compile and run the other versions, please refer to the following sections.
At line 33 of the Makefile, replace encoder_ff_pipeoffarms.o
with:
encoder_ff_farm.o
if you want to run the farm version.encoder_ff_pipeoffarms.o
if you want to run the pipeline of farms version.encoder_ff_farmofpipes.o
if you want to run the farm of pipelines version.encoder_ff_ofarm.o
if you want to run the ordered farm version.
After that, build and run dedup as usual.
At line 78 of the Makefile, replace ferret-ff-pipeoffarms
with:
ferret-ff-farm
if you want to run the farm version.ferret-ff-farm-optimized
if you want to run the farm (optimized) version.ferret-ff-farmofpipes
if you want to run the farm of pipelines version.ferret-ff-pipeoffarms
if you want to run the pipelines of farms version.
After that, build and run ferret as usual.
It is possible to specify requirements on performance (throughput or execution time) and/or power and energy
consumption for all the benchmarks. We provide this possibility by exploiting dynamic reconfiguration of the applications
by relying on Nornir runtime. The runtime will automatically change the number of cores
allocated to the application and their clock frequency.
To exploit this possibility, you need to put an XML file (called parameters.xml
) in the p3arsec root directory,
containing requirements in terms of performance and power consumption. The XML file must have the following format:
<?xml version="1.0" encoding="UTF-8"?>
<nornirParameters>
<requirements>
<throughput>100</throughput>
<powerConsumption>MIN</powerConsumption>
</requirements>
</nornirParameters>
In this specific example, we require the application to have a troughput greater than 100 iterations per second. Moreover, since many configurations may provide such throughput, we require Nornir to choose the configuration with the lowest power consumption among those with a feasible throughput. For more details about the type of parameters that can be specified please refer to Nornir Documentation. The meaning of iteration (i.e. the way in which we measure the throughput) is application-specific. In the following table we show what do we mean for iteration for each benchmark application:
Application | Iteration |
---|---|
Blacksholes | 1 Stock Option |
Bodytrack | 1 Frame |
Canneal | 1 Move |
Dedup | 1 Chunk |
Facesim | 1 Frame |
Ferret | 1 Query |
Fluidanimate | 1 Frame |
Freqmine | 1 Call of the FP_growth function |
Raytrace | 1 Frame |
Streamcluster | 1 Evaluation for opening a new center |
Swaptions | 1 Simulation |
Vips | 1 Image Tile |
x264 | 1 Frame |
For example, the example XML file we shown before would enforce Blackscholes to process at least 100 Stock Options per second.
If you want to compile/run applications with dynamic reconfiguration enabled, use the following
configurations (to be specified through the -c
parameter):
- gcc-ff-nornir for the FastFlow implementation.
- gcc-pthreads-nornir for the Pthreads implementation.
- gcc-openmp-nornir for the OpenMP implementation.
- gcc-tbb-nornir for the Intel TBB implementation.
ATTENTION: To run gcc-*-nornir configurations sudo rights are required since we need to perform some high-priviledge operations such as: reading the power consumption, dynamically scaling the clock frequency, etc...
The structure and modelling of the applications is described in the paper:
@article{10.1145/3132710,
author = {De Sensi, Daniele and De Matteis, Tiziano and Torquati, Massimo and Mencagli, Gabriele and Danelutto, Marco},
title = {Bringing Parallel Patterns Out of the Corner: The P3 ARSEC Benchmark Suite},
year = {2017},
issue_date = {December 2017},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {14},
number = {4},
issn = {1544-3566},
url = {https://doi.org/10.1145/3132710},
doi = {10.1145/3132710},
journal = {ACM Trans. Archit. Code Optim.},
month = {oct},
articleno = {33},
numpages = {26},
keywords = {multicore programming, parsec, Parallel patterns, algorithmic skeletons, benchmarking}
}
Release v1.0 was used in the paper.
P3ARSEC has been developed by [Daniele De Sensi](mailto: ddesensi@ethz.ch) and Tiziano De Matteis.