-
Notifications
You must be signed in to change notification settings - Fork 6
/
README
224 lines (193 loc) · 13.1 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
================================================================================
CAP Bench
A benchmark suite for performance and energy evaluation of low-power many-core processors
================================================================================
CAP Bench was born from a joint initiative involving three research groups:
* Computer Architecture and Parallel Processing Team (CArT)
- Pontifical Catholic University of Minas Gerais (PUC Minas)
- https://www.cart-research.com/home/cpu-memory-noc
* Parallel and Distributed Processing Group (GPPD)
- Federal University of Rio Grande do Sul (UFRGS)
* Nanosimulations and Embedded Applications
for Hybrid Multi-core Architectures (Nanosim)
- University of Grenoble.
================================================================================
How to cite:
Souza, M. A., Penna, P. H., Queiroz, M. M., Pereira, A. D., Góes, L. F. W.,
Freitas, H. C., Castro, M., Navaux, P. O.A., and Méhaut, J.-F. (2017)
CAP Bench: a benchmark suite for performance and energy evaluation of low-power
many-core processors. Concurrency Computat.: Pract. Exper., 29: e3892,
doi: 10.1002/cpe.3892.
================================================================================
To compose the benchmark, seven algorithms are proposed, aiming to cover a wide
range of characteristics.
Kernels follow five different parallel patterns:
* Divide and Conquer, which starts with a division of the problem into small
subproblems solvable in parallel, ending in a merging of the subsolutions
into a result;
* Map, where operations are mostly uniform and applied individually over
elements of a data structure;
* MapReduce, which combines Map and a consolidation of results in a Reduce
procedure;
* Workpool, where the algorithm can be divided into independent tasks and
distributed among the execution units in a balanced way;
* Stencil, in which a function can access an element in a collection and its
neighbors, given by relative offsets.
=======
KERNELS
=======
1) Features from Accelerated Segment Test - FAST
Features from Accelerated Segment Test (FAST) is a corner detection
method that follows the Stencil parallel pattern. It is usually used to
extract feature points and to track and map objects in computer vision
tasks. It uses a circle of 16 pixels to classify whether a candidate
point p is actually a corner. Each pixel in the circle is labeled from
integer number 1 to 16 clockwise. If all N contiguous pixels in the
circle are brighter than the intensity of the candidate pixel p plus a
threshold value t or all darker than the intensity of p - t, then p is
classified as a corner.
In the MPPA-256 implementation, we use randomly generated images as
inputs and a mask containing the positions relative to p that must be
analyzed. The N value is set to 12 and threshold t is set to 20. Due to
very large input images and the compute cluster memory restriction, the
I/O subsystem partitions the input image into 256 KB chunks and sends
them to compute clusters for corner detection. We consider that a medium
granularity, that provokes intense communications and the NoC-boundary
behavior. After that, output chunks are sent back to the I/O subsystem,
which in turn puts them together to build the output image that indicates
all corners present in it. Moreover, the I/O subsystem receives the
amount of corners detected by each compute cluster and summarizes them
to indicate the overall number of corners detected. The number of
detected corners may differ, pointing out the irregularity of the
algorithm.
2) Friendly Numbers - FN
In number theory, two natural numbers are friendly if they share the same
abundance. For a given number n, we could define its abundance as the
ratio between the sum of divisors of n and n itself. The parallel pattern
used in FN implementation is MapReduce.
The MPPA-256 implementation uses a master/slave approach. Since every
processing task for FN can be executed independently, we split the number
interval in sets of equal size in the master process and distribute them
among the compute clusters to be simultaneously processed. These sets are
balanced, causing a regular task load in the slave process. The abundance
results are sent back to the master process, which then performs
abundance comparisons using the 4 I/O clusters. This computation is not
overwhelmed by the NoC use or memory access, being exclusively CPU-bound.
3) Gaussian Filter - GF
The Gaussian blur filter is an image smoothing filter that seeks to
reduce noise and achieve an overall smoothing of the image. It consists
in applying a specially computed two-dimensional Gaussian mask to an
image, using a matrix convolution operation. It uses the Stencil
parallel pattern.
In the MPPA-256 implementation, we use randomly generated masks and
images as inputs. Since some input images are very large and the compute
clusters have a 2 MB memory restriction, the I/O subsystem partitions the
image into 1 MB chunks and sends them to compute clusters to be filtered.
This is an reasonable size, inducing a moderate NoC usage that does not
overwhelm the general computation. Besides that, the task load will
always be constant. After the individual chunks are filtered, they will
be sent back to the I/O subsystem, which will put them together to build
the output image.
4) Integer Sort - IS
The integer sort problem consists in sorting a very large amount of
integer numbers. An implementation of integer sort can use the bucket
sort approach, which divides elements into buckets. A bucket is a
structure that stores numbers in a certain range. The integer numbers
used as input are randomly generated and range from 0 to 2^20 -1. The IS
kernel uses the Divide and Conquer parallel pattern.
In the MPPA-256 implementation, the buckets are further subdivided into
minibuckets of 1 MB. As input elements are mapped to the appropriate
buckets, they are placed in a minibucket. When the minibucket becomes
full, a new minibucket is allocated inside its parent bucket and starts
receiving elements. This takes place in the I/O subsystem, which will
also send minibuckets for compute clusters to work on. Each compute
cluster receives only one minibucket at a time, due to memory
restrictions. Inside a compute cluster, minibuckets are sorted using a
parallel mergesort algorithm, and as the starting order are random, the
task load is irregular. Sorted minibuckets are then sent back to the I/O
subsystem to be merged, using Pthreads to take advantage of its 4 cores.
Because of this flow of minibuckets, IS is a high-intense algorithm in
terms of communication, being restricted by NoC boundaries.
5) K-Means - KM
K-Means clustering is a data clustering solution employed in clustering
analysis. We opted to use Lloyd's algorithm in our work. Given a set of
n\ points in a real d-dimensional space, the problem is to partition
these n points into k partitions. The data points are evenly and
randomly distributed among the k partitions, and the initial centroids
are computed. Then, the data points are re-clustered into partitions
taking into account the minimum Euclidean distance between them and the
centroids. Next, the centroid of each partition is recalculated taking
the mean of all points in the partition. The whole procedure is repeated
until no centroid is changed and every point is farther than the minimum
accepted distance. During the execution, the number of points within
each partition may differ, implying different recalculation times for
each partition’s centroid. The parallel pattern of KM is Map.
The MPPA-256 version of the K-Means algorithm takes additional parameters
p, specifying the number of compute clusters to be used, and t, which
specifies the total number of execution flows. Each cluster spawns t
working threads, so the total number of threads equals p * t. In our
strategy, it provokes a high-intense communication between the I/O
subsystem and the compute clusters. We first distribute data points and
replicate data centroids among clusters, and then loop over a two-phase
iteration. First, partitions are populated. Then, data centroids are
recalculated, a memory-intensive process. For this recalculation, each
cluster uses its local data points to compute partial centroids, i.e.,
a partial sum of data points and population within a partition. Next,
lusters exchange partial centroids so each cluster ends up with the
partial centroids of the same partitions. The amount of work for each
thread may vary during each iteration, an irregular task load
haracteristic. Finally, clusters compute their local centroids and send
them to the master process.
6) LU Factorization - LU
The LU algorithm is a matrix decomposition algorithm which factors a
matrix as a product of two triangular matrices: lower (L) and upper (U).
To compute the L and U matrices -- we opted to implement Gaussian
elimination -- n-1 iterations are required, where n is the order of A
(always square). In each iteration, a sub-matrix of a smaller order
(1 unit smaller) is analyzed, starting from order n. First, the biggest
element in the submatrix (pivot) is found. Then, this element is moved
to the (1,1) position, shifting rows and columns appropriately. The line
containing the pivot is subsequently divided by it, thus the pivot
element becomes 1. The last step (reduction) aims to nullify elements
below the main diagonal. For every line l below the pivot, we multiply
the pivot line p by the opposite of the first element e of l and replace
l with l + p. We store -e in a separate matrix. After the iterations are
finished, every line would have undergone reduction, and the resulting
matrix is the U matrix. The L matrix is formed by the -e factors that
were stored in the separate matrix. Therefore, both matrices are computed
simultaneously. The LU kernel uses the Workpool parallel pattern.
We implemented the LU factorization algorithm for the MPPA-256 assigning
rows to compute clusters so the largest element can be found. Each
compute cluster receives no more than 1 MB of data, in this case,
causing a high-intense communication. In the other hand, the distributed
task loads are regular. The same restriction of 1 MB applies when
distributing lines among the compute clusters to apply reduction.
Swapping rows so the pivot becomes the first element in the matrix is
done exclusively in the master process (I/O subsystem). The I/O subsystem
is also used to rebuild the L and U matrices from chunks processed in the
compute clusters.
7) Traveling-Salesman Problem - TSP
The TSP consists in solving the routing problem of a hypothetical
traveling salesman. Such a route must pass through n towns, only once per
town, return to the town of origin and have the shortest possible
length. We opted to use a brute-force approach based on a simple
heuristic. The sequential version of the algorithm is based on the branch
and bound method using brute force. It takes as input the number of towns
and a cost matrix, and outputs the minimum path length. The algorithm
does a depth-first search looking for the shortest path, pruning paths
that have a bigger cost than the current minimum cost. This pruning
introduces irregularities in the algorithm, since the search depth needed
to discard one of the branches depends on the order in which the branches
were searched. Our implementation of TSP uses the Workpool parallel
pattern.
The MPPA-256 implementation uses a task queue in which tasks are branches
of the search tree. Compute clusters take jobs from it to be executed.
The number of clusters and the number of threads define the total number
of lines of execution. For each cluster, n threads will be spawned, thus
totaling n_threads * n_clusters threads. When the minimum path is
updated, this update is broadcast to every other cluster so they can also
use it to optimize their execution. At the end of the execution, one of
the clusters (typically the 0-th) prints the solution. The final solution
might have been discovered by any one of the clusters, however all of
them are aware of it due to the broadcasts of each discovered minimum
path.