Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 91 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,97 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Yian Chen
* [LinkedIn](https://www.linkedin.com/in/yian-chen-33a31a1a8/), [personal website](https://sydianandrewchen.github.io/) etc.
* Tested on: Windows 10, AMD Ryzen 5800 HS with Radeon Graphics CPU @ 3.20GHz 16GB, NVIDIA GeForce RTX3060 Laptop 8GB

### (TODO: Your README)
### Questions

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)

* Compare all of these GPU Scan implementations (Naive, Work-Efficient, and
Thrust) to the serial CPU version of Scan. Plot a graph of the comparison
(with array size on the independent axis).
![](img/scan-power-of-two.png)
![](img/scan-non-power-of-two.png)
* To guess at what might be happening inside the Thrust implementation (e.g.
allocation, memory copy), take a look at the Nsight timeline for its
execution. Your analysis here doesn't have to be detailed, since you aren't
even looking at the code for the implementation.
- Timeline of `thrust::exclusive_scan`:
![](img/image.png)
From the graph we can see that there are three scans launched at the same time and ended at the same time.
I guess probably that `thrust::exclusive_scan` implemented an algorithm that uses Up Sweep, Down Sweep and Shifting at the same time.

* Write a brief explanation of the phenomena you see here.
- As the size of data grow exponentially, the time of CPU scanning algorithm and GPU naive scan will increase most rapidly.
- The CPU scanning algorithm can beaet GPU naive scan algorithm when data size is large. I guess that might be caused by the full exploitation of locality in the CPU scanning algorithm. On the contrary, GPU naive scan will usually cause a lot of global memory acceesses.
- The time of GPU efficient scan and `thrust` scan increase much slower. `thrust` performs the best.
* Can you find the performance bottlenecks? Is it memory I/O? Computation? Is it different for each implementation?
The performance bottlenecks should exist within the design of each implementation.
- For CPU method, the bottlenecks appear because of the lack of parallellism, comparing with GPU methods.
- For Naive Scan, the bottlenecks appear because of:
- Double buffering. On the positive side, this trick can save us a large amount of time copying memory. However, two large double-bufferr will cause a stable global memory access within each thread.
- Time complexity.
- For Efficient Scan:
- Redundant thread launching(settled).
- Shared memory.


* Paste the output of the test program into a triple-backtick block in your
README.
- Output of the scan (`SIZE = 1 << 26`)
```
****************
** SCAN TESTS **
****************
[ 6 12 22 4 21 5 23 32 32 36 46 37 34 ... 47 0 ]
==== cpu scan, power-of-two ====
elapsed time: 48.9518ms (std::chrono Measured)
[ 0 6 18 40 44 65 70 93 125 157 193 239 276 ... 1643792958 1643793005 ]
==== cpu scan, non-power-of-two ====
elapsed time: 45.263ms (std::chrono Measured)
[ 0 6 18 40 44 65 70 93 125 157 193 239 276 ... 1643792920 1643792939 ]
passed
==== naive scan, power-of-two ====
elapsed time: 63.2586ms (CUDA Measured)
[ 0 6 18 40 44 65 70 93 125 157 193 239 276 ... 1643792958 1643793005 ]
passed
==== naive scan, non-power-of-two ====
elapsed time: 62.9893ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 24.2463ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 22.6304ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 2.46502ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 2.53338ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 1 0 1 1 3 0 1 0 1 0 0 3 1 ... 2 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 120.001ms (std::chrono Measured)
[ 1 1 1 3 1 1 3 1 3 3 2 3 3 ... 3 2 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 123.921ms (std::chrono Measured)
[ 1 1 1 3 1 1 3 1 3 3 2 3 3 ... 3 3 ]
passed
==== cpu compact with scan ====
elapsed time: 219.142ms (std::chrono Measured)
[ 1 1 1 3 1 1 3 1 3 3 2 3 3 ... 3 2 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 22.6714ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 22.6632ms (CUDA Measured)
passed
```
Binary file added img/image.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/scan-non-power-of-two.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/scan-power-of-two.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
17 changes: 14 additions & 3 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,13 @@
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int SIZE = 1 << 26; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
int *c = new int[SIZE];

int main(int argc, char* argv[]) {
void scanTest() {
// Scan tests

printf("\n");
Expand Down Expand Up @@ -51,7 +51,7 @@ int main(int argc, char* argv[]) {
printDesc("naive scan, power-of-two");
StreamCompaction::Naive::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

/* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
Expand Down Expand Up @@ -94,6 +94,9 @@ int main(int argc, char* argv[]) {
printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);
}

void compactionTest() {

printf("\n");
printf("*****************************\n");
Expand Down Expand Up @@ -146,8 +149,16 @@ int main(int argc, char* argv[]) {
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);
}

void test() {
scanTest();
compactionTest();
system("pause"); // stop Win32 console from closing on exit
}

int main(int argc, char* argv[]) {
test();
delete[] a;
delete[] b;
delete[] c;
Expand Down
10 changes: 10 additions & 0 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,10 @@ namespace StreamCompaction {
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int index = (blockDim.x * blockIdx.x) + threadIdx.x;
if (index < n) {
bools[index] = bool(idata[index]);
}
}

/**
Expand All @@ -33,6 +37,12 @@ namespace StreamCompaction {
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
int index = (blockDim.x * blockIdx.x) + threadIdx.x;
if (index < n) {
if (bools[index]) {
odata[indices[index]] = idata[index];
}
}
}

}
Expand Down
2 changes: 1 addition & 1 deletion stream_compaction/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@

#define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
#define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)

#define blockSize 512
/**
* Check for CUDA errors; print and exit if there was a problem.
*/
Expand Down
33 changes: 30 additions & 3 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,17 @@ namespace StreamCompaction {
return timer;
}

void scanCore(int n, int* odata, const int* idata) {
//for (int i = 0; i < n; ++i) {
// for (int j = 0; j < i /* exclusive prefix sum */; ++j) {
// odata[i] += idata[j];
// }
//}
for (int i = 1; i < n; ++i) {
odata[i] = odata[i - 1] + idata[i-1];
}
}

/**
* CPU scan (prefix sum).
* For performance analysis, this is supposed to be a simple for loop.
Expand All @@ -20,6 +31,7 @@ namespace StreamCompaction {
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
scanCore(n, odata, idata);
timer().endCpuTimer();
}

Expand All @@ -31,8 +43,12 @@ namespace StreamCompaction {
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int oPtr = 0;
for (int i = 0; i < n; ++i) {
if (idata[i]) odata[oPtr++] = idata[i];
}
timer().endCpuTimer();
return -1;
return oPtr;
}

/**
Expand All @@ -41,10 +57,21 @@ namespace StreamCompaction {
* @returns the number of elements remaining after compaction.
*/
int compactWithScan(int n, int *odata, const int *idata) {
int* odata_tmp = new int[n];
timer().startCpuTimer();
// TODO

for (int i = 0; i < n; ++i) {
odata_tmp[i] = !(!idata[i]);
if (i) odata_tmp[i] += odata_tmp[i - 1];
if (idata[i]) {
odata[odata_tmp[i] - 1] = idata[i];
}
}
int oSize = odata_tmp[n-1];
timer().endCpuTimer();
return -1;
delete [] odata_tmp;

return oSize;
}
}
}
120 changes: 117 additions & 3 deletions stream_compaction/efficient.cu
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,75 @@ namespace StreamCompaction {
/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/

void scanCore(int n, int* dev_odata) {

dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
// Reduce
int offset = 1;
for (int d = 0; d < ilog2(n); d++) {
int operation_number = n / (offset * 2);
dim3 blocksPerGrid((operation_number + (blockSize - 1)) / blockSize);
//printf("%d\n", blocksPerGrid.x);
//printf("Cut off unneccessary threads\n");
if (blocksPerGrid.x == 1) {
kernUpSweep << <1, operation_number >> > (n, offset, dev_odata);
}
else
kernUpSweep << <blocksPerGrid, blockSize >> > (n, offset, dev_odata);
//kernUpSweep << <fullBlocksPerGrid, blockSize >> > (n, offset, dev_odata); // 0.31504 for power of two
checkCUDAError("kernUpSweep failed");
offset <<= 1;
}

// Down sweep
for (int d = ilog2(n) - 1; d >= 0; d--) {
offset = (1 << d);
int operation_number = n / (offset * 2);
dim3 blocksPerGrid((operation_number + (blockSize - 1)) / blockSize);
//printf("%d\n", blocksPerGrid.x);
//printf("Cut off unneccessary threads\n");
if (blocksPerGrid.x == 1) {
kernDownSweep << <1, operation_number >> > (n, offset, dev_odata);
}
else
kernDownSweep << <blocksPerGrid, blockSize >> > (n, offset, dev_odata);
//kernDownSweep << <fullBlocksPerGrid, blockSize >> > (n, offset, dev_odata);
checkCUDAError("kernDownSweep failed");
}

}
void scan(int n, int *odata, const int *idata) {
int padded_n = (1 << ilog2ceil(n));
int* dev_odata;
cudaMalloc(&dev_odata, padded_n * sizeof(int));
cudaMemcpy(dev_odata, idata, n * sizeof(int), cudaMemcpyKind::cudaMemcpyHostToDevice);
timer().startGpuTimer();
// TODO
scanCore(padded_n, dev_odata);
timer().endGpuTimer();
cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyKind::cudaMemcpyDeviceToHost);
cudaFree(dev_odata);
}

__global__ void kernUpSweep(int n,int offset, int* odata1) {
int index = (blockDim.x * blockIdx.x) + threadIdx.x;
//printf("%d, %d, %d, %d\n",n, index, offset, index*offset*2);
int arrIndex = index * (offset * 2);
if (arrIndex < n) {
odata1[arrIndex + offset * 2 - 1] += odata1[arrIndex + offset - 1];
odata1[n-1] = 0;
}
}

__global__ void kernDownSweep(int n, int offset, int* odata1) {
int index = (blockDim.x * blockIdx.x) + threadIdx.x;
//printf("%d, %d, %d, %d\n",n, index, offset, index*offset*2);
int arrIndex = index * (offset * 2);
if (arrIndex < n) {
int t = odata1[arrIndex + offset - 1];
odata1[arrIndex + offset - 1] = odata1[arrIndex + offset * 2 - 1];
odata1[arrIndex + offset * 2 - 1] += t;
}
}

/**
Expand All @@ -31,10 +96,59 @@ namespace StreamCompaction {
* @returns The number of elements remaining after compaction.
*/
int compact(int n, int *odata, const int *idata) {
timer().startGpuTimer();
// TODO
/*
bools, indices should only be allocated on device
odata and idata needs to be copied to device
*/
int padded_n = (1 << ilog2ceil(n));

int* dev_bools;
/* TODO: Check if remaining part is also zero OR DOESN'T MATTER? */
cudaMalloc(&dev_bools, padded_n * sizeof(int));
checkCUDAError("cudaMalloc dev_bools failed");

int* dev_idata;
cudaMalloc(&dev_idata, n * sizeof(int));
checkCUDAError("cudaMalloc dev_idata failed");

cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyKind::cudaMemcpyHostToDevice);
checkCUDAError("cudaMemcpy from idata to dev_idata failed");

int* dev_indices;
cudaMalloc(&dev_indices, padded_n * sizeof(int));
checkCUDAError("cudaMalloc dev_indices failed");

int* dev_odata;
cudaMalloc(&dev_odata, n * sizeof(int));
checkCUDAError("cudaMalloc dev_odata failed");

dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
StreamCompaction::Common::kernMapToBoolean<<<fullBlocksPerGrid, blockSize>>>(n, dev_bools, dev_idata);
checkCUDAError("kernMapToBoolean failed");

cudaMemcpy(dev_indices, dev_bools, padded_n * sizeof(int), cudaMemcpyKind::cudaMemcpyDeviceToDevice);

timer().startGpuTimer();
scanCore(padded_n, dev_indices);
timer().endGpuTimer();
return -1;
StreamCompaction::Common::kernScatter<<<fullBlocksPerGrid, blockSize>>>(n, dev_odata, dev_idata, dev_bools, dev_indices);
checkCUDAError("kernScatter failed");

/* Still got problem here! */
int length, last_element;
cudaMemcpy(&length, dev_indices + n - 1, sizeof(int), cudaMemcpyKind::cudaMemcpyDeviceToHost);
cudaMemcpy(&last_element, dev_bools + n-1, sizeof(int), cudaMemcpyKind::cudaMemcpyDeviceToHost);
checkCUDAError("cudaMemcpy from dev_indices[n-1] to length failed");
length += last_element;
cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyKind::cudaMemcpyDeviceToHost);

cudaFree(dev_idata);
cudaFree(dev_odata);
cudaFree(dev_indices);
cudaFree(dev_bools);

return length;
}
}
}
Loading