README

================================================================================
                                    CAP Bench 
A benchmark suite for performance and energy evaluation of low-power many-core processors
================================================================================

CAP Bench was born from a joint initiative involving three research groups:

    * Computer Architecture and Parallel Processing Team (CArT) 
       - Pontifical Catholic University of Minas Gerais (PUC Minas)
       - https://www.cart-research.com/home/cpu-memory-noc
    
    * Parallel and Distributed Processing Group (GPPD)
       - Federal University of Rio Grande do Sul (UFRGS)
    
    * Nanosimulations and Embedded Applications
      for Hybrid Multi-core Architectures (Nanosim) 
       - University of Grenoble.

================================================================================
How to cite:
Souza, M. A., Penna, P. H., Queiroz, M. M., Pereira, A. D., Góes, L. F. W.,
Freitas, H. C., Castro, M., Navaux, P. O.A., and Méhaut, J.-F. (2017)
CAP Bench: a benchmark suite for performance and energy evaluation of low-power
many-core processors. Concurrency Computat.: Pract. Exper., 29: e3892,
doi: 10.1002/cpe.3892.
================================================================================

To compose the benchmark, seven algorithms are proposed, aiming to cover a wide
range of characteristics. 

Kernels follow five different parallel patterns:

    * Divide and Conquer, which starts with a division of the problem into small
      subproblems solvable in parallel, ending in a merging of the subsolutions
      into a result; 
    * Map, where operations are mostly uniform and applied individually over
      elements of a data structure; 
    * MapReduce, which combines Map and a consolidation of results in a Reduce
      procedure; 
    * Workpool, where the algorithm can be divided into independent tasks and
      distributed among the execution units in a balanced way; 
    * Stencil, in which a function can access an element in a collection and its
      neighbors, given by relative offsets. 

=======
KERNELS
=======

    1) Features from Accelerated Segment Test - FAST

       Features from Accelerated Segment Test (FAST) is a corner detection 
       method that follows the Stencil parallel pattern. It is usually used to
       extract feature points and to track and map objects in computer vision
       tasks. It uses a circle of 16 pixels to classify whether a candidate
       point p is actually a corner. Each pixel in the circle is labeled from
       integer number 1 to 16 clockwise. If all N contiguous pixels in the
       circle are brighter than the intensity of the candidate pixel p plus a
       threshold value t or all darker than the intensity of p - t, then p is
       classified as a corner. 
       
       In the MPPA-256 implementation, we use randomly generated images as
       inputs and a mask containing the positions relative to p that must be
       analyzed. The N value is set to 12 and threshold t is set to 20. Due to
       very large input images and the compute cluster memory restriction, the
       I/O subsystem partitions the input image into 256 KB chunks and sends
       them to compute clusters for corner detection. We consider that a medium
       granularity, that provokes intense communications and the NoC-boundary
       behavior. After that, output chunks are sent back to the I/O subsystem,
       which in turn puts them together to build the output image that indicates
       all corners present in it. Moreover, the I/O subsystem receives the
       amount of corners detected by each compute cluster and summarizes them
       to indicate the overall number of corners detected. The number of
       detected corners may differ, pointing out the irregularity of the
       algorithm.

    2) Friendly Numbers - FN

       In number theory, two natural numbers are friendly if they share the same
       abundance. For a given number n, we could define its abundance as the
       ratio between the sum of divisors of n and n itself. The parallel pattern
       used in FN implementation is MapReduce.
       
       The MPPA-256 implementation uses a master/slave approach. Since every
       processing task for FN can be executed independently, we split the number
       interval in sets of equal size in the master process and distribute them
       among the compute clusters to be simultaneously processed. These sets are
       balanced, causing a regular task load in the slave process. The abundance
       results are sent back to the master process, which then performs
       abundance comparisons using the 4 I/O clusters. This computation is not
       overwhelmed by the NoC use or memory access, being exclusively CPU-bound.

    3) Gaussian Filter - GF

       The Gaussian blur filter is an image smoothing filter that seeks to
       reduce noise and achieve an overall smoothing of the image. It consists
       in applying a specially computed two-dimensional Gaussian mask to an
       image, using a matrix convolution operation. It uses the Stencil
       parallel pattern.
       
       In the MPPA-256 implementation, we use randomly generated masks and
       images as inputs. Since some input images are very large and the compute
       clusters have a 2 MB memory restriction, the I/O subsystem partitions the
       image into 1 MB chunks and sends them to compute clusters to be filtered.
       This is an reasonable size, inducing a moderate NoC usage that does not
       overwhelm the general computation. Besides that, the task load will
       always be constant. After the individual chunks are filtered, they will
       be sent back to the I/O subsystem, which will put them together to build
       the output image.

    4) Integer Sort - IS

       The integer sort problem consists in sorting a very large amount of
       integer numbers. An implementation of integer sort can use the bucket
       sort approach, which divides elements into buckets. A bucket is a
       structure that stores numbers in a certain range. The integer numbers
       used as input are randomly generated and range from 0 to 2^20 -1. The IS
       kernel uses the Divide and Conquer parallel pattern.
       
       In the MPPA-256 implementation, the buckets are further subdivided into
       minibuckets of 1 MB. As input elements are mapped to the appropriate
       buckets, they are placed in a minibucket. When the minibucket becomes
       full, a new minibucket is allocated inside its parent bucket and starts
       receiving elements. This takes place in the I/O subsystem, which will
       also send minibuckets for compute clusters to work on. Each compute
       cluster receives only one minibucket at a time, due to memory
       restrictions. Inside a compute cluster, minibuckets are sorted using a
       parallel mergesort algorithm, and as the starting order are random, the
       task load is irregular. Sorted minibuckets are then sent back to the I/O
       subsystem to be merged, using Pthreads to take advantage of its 4 cores.
       Because of this flow of minibuckets, IS is a high-intense algorithm in
       terms of communication, being restricted by NoC boundaries.

    5) K-Means - KM

       K-Means clustering is a data clustering solution employed in clustering
       analysis. We opted to use Lloyd's algorithm in our work. Given a set of
       n\ points in a real d-dimensional space, the problem is to partition
       these n points into k partitions. The data points are evenly and
       randomly distributed among the k partitions, and the initial centroids
       are computed. Then, the data points are re-clustered into partitions
       taking into account the minimum Euclidean distance between them and the
       centroids. Next, the centroid of each partition is recalculated taking
       the mean of all points in the partition. The whole procedure is repeated
       until no centroid is changed and every point is farther than the minimum
       accepted distance. During the execution, the number of points within
       each partition may differ, implying different recalculation times for
       each partition’s centroid. The parallel pattern of KM is Map.
       
       The MPPA-256 version of the K-Means algorithm takes additional parameters
       p, specifying the number of compute clusters to be used, and t, which
       specifies the total number of execution flows. Each cluster spawns t
       working threads, so the total number of threads equals p * t. In our
       strategy, it provokes a high-intense communication between the I/O
       subsystem and the compute clusters. We first distribute data points and
       replicate data centroids among clusters, and then loop over a two-phase
       iteration. First, partitions are populated. Then, data centroids are
       recalculated, a memory-intensive process. For this recalculation, each
       cluster uses its local data points to compute partial centroids, i.e.,
       a partial sum of data points and population within a partition. Next, 
       lusters exchange partial centroids so each cluster ends up with the
       partial centroids of the same partitions. The amount of work for each
       thread may vary during each iteration, an irregular task load
       haracteristic. Finally, clusters compute their local centroids and send
       them to the master process.

    6) LU Factorization - LU

       The LU algorithm is a matrix decomposition algorithm which factors a
       matrix as a product of two triangular matrices: lower (L) and upper (U).
       To compute the L and U matrices -- we opted to implement Gaussian
       elimination -- n-1 iterations are required, where n is the order of A
       (always square). In each iteration, a sub-matrix of a smaller order
       (1 unit smaller) is analyzed, starting from order n. First, the biggest
       element in the submatrix (pivot) is found. Then, this element is moved
       to the (1,1) position, shifting rows and columns appropriately. The line
       containing the pivot is subsequently divided by it, thus the pivot
       element becomes 1. The last step (reduction) aims to nullify elements
       below the main diagonal. For every line l below the pivot, we multiply
       the pivot line p by the opposite of the first element e of l and replace
       l with l + p. We store -e in a separate matrix. After the iterations are
       finished, every line would have undergone reduction, and the resulting
       matrix is the U matrix. The L matrix is formed by the -e factors that
       were stored in the separate matrix. Therefore, both matrices are computed
       simultaneously. The LU kernel uses the Workpool parallel pattern.
       
       We implemented the LU factorization algorithm for the MPPA-256 assigning
       rows to compute clusters so the largest element can be found. Each
       compute cluster receives no more than 1 MB of data, in this case,
       causing a high-intense communication. In the other hand, the distributed
       task loads are regular. The same restriction of 1 MB applies when
       distributing lines among the compute clusters to apply reduction.
       Swapping rows so the pivot becomes the first element in the matrix is
       done exclusively in the master process (I/O subsystem). The I/O subsystem
       is also used to rebuild the L and U matrices from chunks processed in the
       compute clusters.

    7) Traveling-Salesman Problem - TSP

       The TSP consists in solving the routing problem of a hypothetical
       traveling salesman. Such a route must pass through n towns, only once per
       town, return to the town of origin and have the shortest possible
       length. We opted to use a brute-force approach based on a simple
       heuristic. The sequential version of the algorithm is based on the branch
       and bound method using brute force. It takes as input the number of towns
       and a cost matrix, and outputs the minimum path length. The algorithm
       does a depth-first search looking for the shortest path, pruning paths
       that have a bigger cost than the current minimum cost. This pruning
       introduces irregularities in the algorithm, since the search depth needed
       to discard one of the branches depends on the order in which the branches
       were searched. Our implementation of TSP uses the Workpool parallel
       pattern.
       
       The MPPA-256 implementation uses a task queue in which tasks are branches
       of the search tree. Compute clusters take jobs from it to be executed.
       The number of clusters and the number of threads define the total number
       of lines of execution. For each cluster, n threads will be spawned, thus
       totaling n_threads * n_clusters threads. When the minimum path is
       updated, this update is broadcast to every other cluster so they can also
       use it to optimize their execution. At the end of the execution, one of
       the clusters (typically the 0-th) prints the solution. The final solution
       might have been discovered by any one of the clusters, however all of
       them are aware of it due to the broadcasts of each discovered minimum
       path.