-
Notifications
You must be signed in to change notification settings - Fork 6
Cart nAnosim gPpd (CAP) Benchmark suite for manycore processors
License
cart-pucminas/CAPBenchmarks
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
================================================================================ CAP Bench A benchmark suite for performance and energy evaluation of low-power many-core processors ================================================================================ CAP Bench was born from a joint initiative involving three research groups: * Computer Architecture and Parallel Processing Team (CArT) - Pontifical Catholic University of Minas Gerais (PUC Minas) - https://www.cart-research.com/home/cpu-memory-noc * Parallel and Distributed Processing Group (GPPD) - Federal University of Rio Grande do Sul (UFRGS) * Nanosimulations and Embedded Applications for Hybrid Multi-core Architectures (Nanosim) - University of Grenoble. ================================================================================ How to cite: Souza, M. A., Penna, P. H., Queiroz, M. M., Pereira, A. D., Góes, L. F. W., Freitas, H. C., Castro, M., Navaux, P. O.A., and Méhaut, J.-F. (2017) CAP Bench: a benchmark suite for performance and energy evaluation of low-power many-core processors. Concurrency Computat.: Pract. Exper., 29: e3892, doi: 10.1002/cpe.3892. ================================================================================ To compose the benchmark, seven algorithms are proposed, aiming to cover a wide range of characteristics. Kernels follow five different parallel patterns: * Divide and Conquer, which starts with a division of the problem into small subproblems solvable in parallel, ending in a merging of the subsolutions into a result; * Map, where operations are mostly uniform and applied individually over elements of a data structure; * MapReduce, which combines Map and a consolidation of results in a Reduce procedure; * Workpool, where the algorithm can be divided into independent tasks and distributed among the execution units in a balanced way; * Stencil, in which a function can access an element in a collection and its neighbors, given by relative offsets. ======= KERNELS ======= 1) Features from Accelerated Segment Test - FAST Features from Accelerated Segment Test (FAST) is a corner detection method that follows the Stencil parallel pattern. It is usually used to extract feature points and to track and map objects in computer vision tasks. It uses a circle of 16 pixels to classify whether a candidate point p is actually a corner. Each pixel in the circle is labeled from integer number 1 to 16 clockwise. If all N contiguous pixels in the circle are brighter than the intensity of the candidate pixel p plus a threshold value t or all darker than the intensity of p - t, then p is classified as a corner. In the MPPA-256 implementation, we use randomly generated images as inputs and a mask containing the positions relative to p that must be analyzed. The N value is set to 12 and threshold t is set to 20. Due to very large input images and the compute cluster memory restriction, the I/O subsystem partitions the input image into 256 KB chunks and sends them to compute clusters for corner detection. We consider that a medium granularity, that provokes intense communications and the NoC-boundary behavior. After that, output chunks are sent back to the I/O subsystem, which in turn puts them together to build the output image that indicates all corners present in it. Moreover, the I/O subsystem receives the amount of corners detected by each compute cluster and summarizes them to indicate the overall number of corners detected. The number of detected corners may differ, pointing out the irregularity of the algorithm. 2) Friendly Numbers - FN In number theory, two natural numbers are friendly if they share the same abundance. For a given number n, we could define its abundance as the ratio between the sum of divisors of n and n itself. The parallel pattern used in FN implementation is MapReduce. The MPPA-256 implementation uses a master/slave approach. Since every processing task for FN can be executed independently, we split the number interval in sets of equal size in the master process and distribute them among the compute clusters to be simultaneously processed. These sets are balanced, causing a regular task load in the slave process. The abundance results are sent back to the master process, which then performs abundance comparisons using the 4 I/O clusters. This computation is not overwhelmed by the NoC use or memory access, being exclusively CPU-bound. 3) Gaussian Filter - GF The Gaussian blur filter is an image smoothing filter that seeks to reduce noise and achieve an overall smoothing of the image. It consists in applying a specially computed two-dimensional Gaussian mask to an image, using a matrix convolution operation. It uses the Stencil parallel pattern. In the MPPA-256 implementation, we use randomly generated masks and images as inputs. Since some input images are very large and the compute clusters have a 2 MB memory restriction, the I/O subsystem partitions the image into 1 MB chunks and sends them to compute clusters to be filtered. This is an reasonable size, inducing a moderate NoC usage that does not overwhelm the general computation. Besides that, the task load will always be constant. After the individual chunks are filtered, they will be sent back to the I/O subsystem, which will put them together to build the output image. 4) Integer Sort - IS The integer sort problem consists in sorting a very large amount of integer numbers. An implementation of integer sort can use the bucket sort approach, which divides elements into buckets. A bucket is a structure that stores numbers in a certain range. The integer numbers used as input are randomly generated and range from 0 to 2^20 -1. The IS kernel uses the Divide and Conquer parallel pattern. In the MPPA-256 implementation, the buckets are further subdivided into minibuckets of 1 MB. As input elements are mapped to the appropriate buckets, they are placed in a minibucket. When the minibucket becomes full, a new minibucket is allocated inside its parent bucket and starts receiving elements. This takes place in the I/O subsystem, which will also send minibuckets for compute clusters to work on. Each compute cluster receives only one minibucket at a time, due to memory restrictions. Inside a compute cluster, minibuckets are sorted using a parallel mergesort algorithm, and as the starting order are random, the task load is irregular. Sorted minibuckets are then sent back to the I/O subsystem to be merged, using Pthreads to take advantage of its 4 cores. Because of this flow of minibuckets, IS is a high-intense algorithm in terms of communication, being restricted by NoC boundaries. 5) K-Means - KM K-Means clustering is a data clustering solution employed in clustering analysis. We opted to use Lloyd's algorithm in our work. Given a set of n\ points in a real d-dimensional space, the problem is to partition these n points into k partitions. The data points are evenly and randomly distributed among the k partitions, and the initial centroids are computed. Then, the data points are re-clustered into partitions taking into account the minimum Euclidean distance between them and the centroids. Next, the centroid of each partition is recalculated taking the mean of all points in the partition. The whole procedure is repeated until no centroid is changed and every point is farther than the minimum accepted distance. During the execution, the number of points within each partition may differ, implying different recalculation times for each partition’s centroid. The parallel pattern of KM is Map. The MPPA-256 version of the K-Means algorithm takes additional parameters p, specifying the number of compute clusters to be used, and t, which specifies the total number of execution flows. Each cluster spawns t working threads, so the total number of threads equals p * t. In our strategy, it provokes a high-intense communication between the I/O subsystem and the compute clusters. We first distribute data points and replicate data centroids among clusters, and then loop over a two-phase iteration. First, partitions are populated. Then, data centroids are recalculated, a memory-intensive process. For this recalculation, each cluster uses its local data points to compute partial centroids, i.e., a partial sum of data points and population within a partition. Next, lusters exchange partial centroids so each cluster ends up with the partial centroids of the same partitions. The amount of work for each thread may vary during each iteration, an irregular task load haracteristic. Finally, clusters compute their local centroids and send them to the master process. 6) LU Factorization - LU The LU algorithm is a matrix decomposition algorithm which factors a matrix as a product of two triangular matrices: lower (L) and upper (U). To compute the L and U matrices -- we opted to implement Gaussian elimination -- n-1 iterations are required, where n is the order of A (always square). In each iteration, a sub-matrix of a smaller order (1 unit smaller) is analyzed, starting from order n. First, the biggest element in the submatrix (pivot) is found. Then, this element is moved to the (1,1) position, shifting rows and columns appropriately. The line containing the pivot is subsequently divided by it, thus the pivot element becomes 1. The last step (reduction) aims to nullify elements below the main diagonal. For every line l below the pivot, we multiply the pivot line p by the opposite of the first element e of l and replace l with l + p. We store -e in a separate matrix. After the iterations are finished, every line would have undergone reduction, and the resulting matrix is the U matrix. The L matrix is formed by the -e factors that were stored in the separate matrix. Therefore, both matrices are computed simultaneously. The LU kernel uses the Workpool parallel pattern. We implemented the LU factorization algorithm for the MPPA-256 assigning rows to compute clusters so the largest element can be found. Each compute cluster receives no more than 1 MB of data, in this case, causing a high-intense communication. In the other hand, the distributed task loads are regular. The same restriction of 1 MB applies when distributing lines among the compute clusters to apply reduction. Swapping rows so the pivot becomes the first element in the matrix is done exclusively in the master process (I/O subsystem). The I/O subsystem is also used to rebuild the L and U matrices from chunks processed in the compute clusters. 7) Traveling-Salesman Problem - TSP The TSP consists in solving the routing problem of a hypothetical traveling salesman. Such a route must pass through n towns, only once per town, return to the town of origin and have the shortest possible length. We opted to use a brute-force approach based on a simple heuristic. The sequential version of the algorithm is based on the branch and bound method using brute force. It takes as input the number of towns and a cost matrix, and outputs the minimum path length. The algorithm does a depth-first search looking for the shortest path, pruning paths that have a bigger cost than the current minimum cost. This pruning introduces irregularities in the algorithm, since the search depth needed to discard one of the branches depends on the order in which the branches were searched. Our implementation of TSP uses the Workpool parallel pattern. The MPPA-256 implementation uses a task queue in which tasks are branches of the search tree. Compute clusters take jobs from it to be executed. The number of clusters and the number of threads define the total number of lines of execution. For each cluster, n threads will be spawned, thus totaling n_threads * n_clusters threads. When the minimum path is updated, this update is broadcast to every other cluster so they can also use it to optimize their execution. At the end of the execution, one of the clusters (typically the 0-th) prints the solution. The final solution might have been discovered by any one of the clusters, however all of them are aware of it due to the broadcasts of each discovered minimum path.
About
Cart nAnosim gPpd (CAP) Benchmark suite for manycore processors
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published