diff --git a/README.md b/README.md index 0e38ddb..d5d9b62 100644 --- a/README.md +++ b/README.md @@ -3,12 +3,113 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +Jason Li ([LinkedIn](https://linkedin.com/in/jeylii)) +* Tested on: Windows 10, Ryzen 5 3600X @ 3.80GHz 32 GB, NVIDIA RTX 4070 -### (TODO: Your README) +## **Implementation and Performance Analysis** -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +Four different implementations of exclusive scan are considered: +- A CPU implementation, which simply loops through the elements of the input array and sums them; +- A Naive GPU implementation, which performs O(nlogn) addition operations in parallel; +- A Work-efficient GPU implementation, which performsn O(n) addition operations in parallel; +- The NVIDIA Thrust library's implementation. + +Futhermore, we are considering three different implementations of stream compaction: +- A CPU implementation without using the CPU implementation of exclusive scan; +- A CPU implementation using the CPU implementation of exclusive scan; +- A GPU work-efficient implementation, using our work-efficient implementation of exclusive scan. + +For extra credit, our work-efficient GPU implementation is sufficiently faster than our naive and CPU implementations, especially for higher array sizes. + +### **Performance for Increasing Array Sizes** +Below we analyze the performance differences between these three implementations. + +First, we roughly optimize the block size for each of our GPU implementations. This is done here by scanning through a range of block sizes, executing the test program, and comparing runtimes throughout the range. + +![naive scan runtime vs block size](img/naive_scan_runtime_vs_block_size.png) +![efficient scan runtime vs block size](img/efficient_scan_runtime_vs_block_size.png) + +As seen in the first graph above, for our naive implementations of exclusive scan, the fastest runtime was observed using a block size of **320**, which we will be using for any further comparisons. +As seen in the second graph above, for both our work-efficient implementation of exclusive scan and stream compaction, the fastest runtimes were seen when using a block size of **512**, whch we will be using in any further comparisons. + +![exclusive scan runtime](img/exclusive_scan_runtime.png) + +From this graph, we can see that while the CPU implementation is much faster for small array sizes on the order of 100K or less, the GPU implementation quickly takes over in terms of speed afterwards and is an order of magnitude faster for large array sizes. +The CPU implementation is likely much faster for small array sizes due to the overhead universal to all GPU implementations, such as memory copying within the GPU, thread scheduling, and kernel calls from the CPU to GPU. +However, once the array sizes become large, the parallelism inherent to the GPU implementations quickly make them much faster than the CPU implementation. + +![gpu exclusive scan runtime](img/gpu_exclusive_scan_runtime.png) + +When we compare within GPU implementations, as in the graph above, we see that for small array sizes (on the order of 1M or less), the differences between the GPU scan implementations is negligible, and the work-efficient scan is in fact generally the slowest implementation. However, after the array sizes become large, the differences between the implementations become clear - with the naive implementation being the slowest, followed by the work-efficient scan, and then followed by the thrust library implementation, which is by far the fastest by an order of magnitude. + +This is expected, as the naive implementation contains several inefficiencies, such as the need to ping-pong between buffers to avoid a race condition, unused threads, and more operations than necessary. The efficient scan eliminates these inefficiencies but runs into additional overhead with memory operations and kernel calls. This is likely why it is slower than the naive implementation on small array sizes. The thrust library's implementation is fastest by a large margin - for the largest array size tested of 230, it was about 15-16 times faster than our work-efficient scan. This is likely due to the use of shared memory as opposed to the global memory used by our GPU implementations. + +![compaction runtime](img/compaction_runtime.png) + +When comparing runtimes of stream compaction implementations, we see that similarly to our scan analysis, the CPU implementations are much faster for small array sizes, while the work-efficient GPU implementation becomes faster one array sizes become large. This is expected due to both the overhead, such as memory operations, and parallelism that come with the GPU implementation. + +Some improvements that could be made to our work-efficient GPU implementations of the exclusive scan and stream compaction are changing the order that we access indices during our passes and removing memory operations. Changing the order of index access could be done such that the active threads in each pass are contained in less warps, which reduces the scheduling work as more warps terminate. In addition, currently we are accessing GPU memory to return our final array size during stream compaction, and to set the last array index to 0 during the "up-sweep" step of the scan operation. These costly memory operations could be optimized out to speed up runtime. + +### Profiling of Thrust Implementation + +![thrust profiling](img/thrust_profiling.png) + +When looking into the Nsight timeline for the Thrust implementation's execution on an array of size 225, we can see that for the two calls to the thrust scan shown in the timeline, over 90% of the time is taken up by memory operations. Here, for the two memory operations which run for >20 ms, the DeviceScanKernel itself only runs for about 63 us. While the thrust library's calculations themselves are extremely optimized, the function is ultimately bottlenecked by I/O operations such as cudaMemcpy. Our other GPU implementations are likely similarly bottlenecked; they require extensive costly global memory access as we are not using shared memory and multiple kernel calls, so computation is likely not the bottleneck here. + +## Test Program Output +This is the test program's output for an array size of 227, with no additional tests added. +``` +**************** +** SCAN TESTS ** +**************** + [ 33 20 46 0 6 2 18 12 18 44 29 43 34 ... 9 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 261.214ms (std::chrono Measured) + [ 0 33 53 99 99 105 107 125 137 155 199 228 271 ... -1007576195 -1007576186 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 257.151ms (std::chrono Measured) + [ 0 33 53 99 99 105 107 125 137 155 199 228 271 ... -1007576245 -1007576214 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 73.2ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + elapsed time: 72.0187ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 41.301ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 41.2993ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + elapsed time: 3.54906ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 3.20854ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 3 0 3 0 1 1 2 2 0 0 1 2 0 ... 1 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 248.794ms (std::chrono Measured) + [ 3 3 1 1 2 2 1 2 1 2 3 3 3 ... 2 1 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 249.189ms (std::chrono Measured) + [ 3 3 1 1 2 2 1 2 1 2 3 3 3 ... 1 1 ] + passed +==== cpu compact with scan ==== + elapsed time: 690.428ms (std::chrono Measured) + [ 3 3 1 1 2 2 1 2 1 2 3 3 3 ... 2 1 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 69.2437ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 51.8314ms (CUDA Measured) + passed +``` diff --git a/img/compaction_runtime.png b/img/compaction_runtime.png new file mode 100644 index 0000000..b2eb345 Binary files /dev/null and b/img/compaction_runtime.png differ diff --git a/img/efficient_scan_runtime_vs_block_size.png b/img/efficient_scan_runtime_vs_block_size.png new file mode 100644 index 0000000..1c453f0 Binary files /dev/null and b/img/efficient_scan_runtime_vs_block_size.png differ diff --git a/img/exclusive_scan_runtime.png b/img/exclusive_scan_runtime.png new file mode 100644 index 0000000..4a502d2 Binary files /dev/null and b/img/exclusive_scan_runtime.png differ diff --git a/img/gpu_exclusive_scan_runtime.png b/img/gpu_exclusive_scan_runtime.png new file mode 100644 index 0000000..a978cea Binary files /dev/null and b/img/gpu_exclusive_scan_runtime.png differ diff --git a/img/naive_scan_runtime_vs_block_size.png b/img/naive_scan_runtime_vs_block_size.png new file mode 100644 index 0000000..a0d144b Binary files /dev/null and b/img/naive_scan_runtime_vs_block_size.png differ diff --git a/img/thrust_profiling.png b/img/thrust_profiling.png new file mode 100644 index 0000000..5600351 Binary files /dev/null and b/img/thrust_profiling.png differ diff --git a/src/main.cpp b/src/main.cpp index 896ac2b..e7ecfd7 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -13,7 +13,7 @@ #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE = 1 << 27; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two int *a = new int[SIZE]; int *b = new int[SIZE]; diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 2ed6d63..9bb4a2d 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -23,7 +23,12 @@ namespace StreamCompaction { * which map to 0 will be removed, and elements which map to 1 will be kept. */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { - // TODO + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) + { + return; + } + bools[index] = (idata[index] == 0) ? 0 : 1; } /** @@ -32,7 +37,15 @@ namespace StreamCompaction { */ __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { - // TODO + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) + { + return; + } + if (bools[index] == 1) + { + odata[indices[index]] = idata[index]; + } } } diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 719fa11..94bb4d4 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -12,6 +12,15 @@ namespace StreamCompaction { return timer; } + void scan_impl(int n, int* odata, const int* idata) + { + odata[0] = 0; + for (int i = 1; i < n; i++) + { + odata[i] = idata[i - 1] + odata[i - 1]; + } + } + /** * CPU scan (prefix sum). * For performance analysis, this is supposed to be a simple for loop. @@ -19,7 +28,7 @@ namespace StreamCompaction { */ void scan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + scan_impl(n, odata, idata); timer().endCpuTimer(); } @@ -30,9 +39,19 @@ namespace StreamCompaction { */ int compactWithoutScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + int output_index = 0; + int curr_data; + for (int i = 0; i < n; i++) + { + curr_data = idata[i]; + if (curr_data != 0) + { + odata[output_index] = curr_data; + output_index++; + } + } timer().endCpuTimer(); - return -1; + return output_index; } /** @@ -42,9 +61,25 @@ namespace StreamCompaction { */ int compactWithScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + int *temp = new int[n]; + for (int i = 0; i < n; i++) + { + temp[i] = idata[i] != 0 ? 1 : 0; + } + int* scan_result = new int[n]; + scan_impl(n, scan_result, temp); + + int final_index = 0; + for (int i = 0; i < n; i++) + { + if (temp[i] == 1) + { + final_index = scan_result[i]; + odata[final_index] = idata[i]; + } + } timer().endCpuTimer(); - return -1; + return final_index + 1; } } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 2db346e..ac22fdb 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -3,22 +3,96 @@ #include "common.h" #include "efficient.h" +#define block_size 512 + namespace StreamCompaction { namespace Efficient { using StreamCompaction::Common::PerformanceTimer; + using StreamCompaction::Common::kernMapToBoolean; + using StreamCompaction::Common::kernScatter; + PerformanceTimer& timer() { static PerformanceTimer timer; return timer; } + __global__ void kern_scan_up_sweep(int d, int n, int *data) + { + /* Memory access errors occur when using int or long on large array sizes + This is most likely due to overflow when converting threadIdx to 1D + And further multiplying by 2^d, this is done to iterate by 2^(d+1) */ + unsigned long long index = threadIdx.x + (blockIdx.x * blockDim.x); + index <<= (d + 1); + if (index >= n) + { + return; + } + data[index + (1 << (d + 1)) - 1] += data[index + (1 << d) - 1]; + } + + __global__ void kern_scan_down_sweep(int d, int n, int* data) + { + unsigned long long index = threadIdx.x + (blockIdx.x * blockDim.x); + index <<= (d + 1); + if (index >= n) + { + return; + } + int temp = data[index + (1 << d) - 1]; + data[index + (1 << d) - 1] = data[index + (1 << (d + 1)) - 1]; + data[index + (1 << (d + 1)) - 1] += temp; + } + + __global__ void kern_add_padding_zeros(int n, int full_data_size, int* data) + { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= full_data_size || index < n) + { + return; + } + data[index] = 0; + } + + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + + int* dev_odata; + + int n_ceil_log_2 = ilog2ceil(n); + int full_data_size = (1 << n_ceil_log_2); + + dim3 full_blocks_per_grid((full_data_size + block_size - 1) / block_size); + + cudaMalloc((void**)&dev_odata, full_data_size * sizeof(int)); + checkCUDAErrorFn("cudaMalloc dev_odata failed!"); + + cudaMemcpy(dev_odata, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + checkCUDAErrorFn("cudaMemcpy dev_odata failed!"); + timer().startGpuTimer(); - // TODO + kern_add_padding_zeros <<>> (n, full_data_size, dev_odata); + for (int i = 0; i < n_ceil_log_2; i++) + { + kern_scan_up_sweep <<>> (i, full_data_size, dev_odata); + } + cudaMemset(&dev_odata[full_data_size - 1], 0, sizeof(int)); + checkCUDAErrorFn("cudaMemset odata last ele failed!"); + + for (int i = n_ceil_log_2 - 1; i >= 0; i--) + { + kern_scan_down_sweep <<>> (i, full_data_size, dev_odata); + } timer().endGpuTimer(); + + cudaMemcpy(odata, dev_odata, sizeof(int) * n, cudaMemcpyDeviceToHost); + checkCUDAErrorFn("cudaMemcpy odata failed!"); + + cudaFree(dev_odata); + checkCUDAErrorFn("cudaFree dev_odata failed!"); } /** @@ -31,10 +105,71 @@ namespace StreamCompaction { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { + + int *dev_idata; + int *dev_odata; + int *dev_scan_result; + int *dev_temp_bools; + + int n_ceil_log_2 = ilog2ceil(n); + int full_data_size = (1 << n_ceil_log_2); + + dim3 n_blocks_per_grid((n + block_size - 1) / block_size); + dim3 full_blocks_per_grid((full_data_size + block_size - 1) / block_size); + + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + checkCUDAErrorFn("cudaMalloc dev_odata failed!"); + + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + checkCUDAErrorFn("cudaMalloc dev_idata failed!"); + + cudaMalloc((void**)&dev_scan_result, full_data_size * sizeof(int)); + checkCUDAErrorFn("cudaMalloc dev_scan_result failed!"); + + cudaMalloc((void**)&dev_temp_bools, n * sizeof(int)); + checkCUDAErrorFn("cudaMalloc dev_temp_bools failed!"); + + cudaMemcpy(dev_idata, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + checkCUDAErrorFn("cudaMemcpy dev_odata failed!"); + timer().startGpuTimer(); - // TODO + // map to boolean + kernMapToBoolean<<>>(n, dev_temp_bools, dev_idata); + cudaMemcpy(dev_scan_result, dev_temp_bools, sizeof(int) * n, cudaMemcpyDeviceToDevice); + + // perform scan + kern_add_padding_zeros << > > (n, full_data_size, dev_scan_result); + for (int i = 0; i < n_ceil_log_2; i++) + { + kern_scan_up_sweep << > > (i, full_data_size, dev_scan_result); + } + cudaMemset(&dev_scan_result[full_data_size - 1], 0, sizeof(int)); + for (int i = n_ceil_log_2 - 1; i >= 0; i--) + { + kern_scan_down_sweep << > > (i, full_data_size, dev_scan_result); + } + int odata_size; + cudaMemcpy(&odata_size, &dev_scan_result[full_data_size - 1], sizeof(int), cudaMemcpyDeviceToHost); + + // scatter + kernScatter<<>>(n, dev_odata, dev_idata, dev_temp_bools, dev_scan_result); timer().endGpuTimer(); - return -1; + + cudaMemcpy(odata, dev_odata, sizeof(int) * odata_size, cudaMemcpyDeviceToHost); + + cudaFree(dev_idata); + checkCUDAErrorFn("cudaFree dev_idata failed!"); + + cudaFree(dev_odata); + checkCUDAErrorFn("cudaFree dev_odata failed!"); + + cudaFree(dev_scan_result); + checkCUDAErrorFn("cudaFree dev_scan_result failed!"); + + cudaFree(dev_temp_bools); + checkCUDAErrorFn("cudaFree dev_temp_bools failed!"); + + return odata_size; } } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 4308876..5e12aff 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -3,6 +3,8 @@ #include "common.h" #include "naive.h" +#define block_size 320 + namespace StreamCompaction { namespace Naive { using StreamCompaction::Common::PerformanceTimer; @@ -12,14 +14,77 @@ namespace StreamCompaction { return timer; } // TODO: __global__ + __global__ void kern_scan_global(int d, int n, const int* idata, int* odata) + { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) + { + return; + } + int offset = (1 << (d - 1)); + if (index >= offset) + { + odata[index] = idata[index - offset] + idata[index]; + } + else + { + odata[index] = idata[index]; + } + } + + __global__ void kern_inclusive_to_exclusive(int n, const int* idata, int* odata) + { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) + { + return; + } + if (index == 0) + { + odata[index] = 0; + } + else + { + odata[index] = idata[index - 1]; + } + } + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + int *dev_idata; + int* dev_odata; + dim3 full_blocks_per_grid((n + block_size - 1) / block_size); + + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + checkCUDAErrorFn("cudaMalloc dev_idata failed!"); + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + checkCUDAErrorFn("cudaMalloc dev_odata failed!"); + + cudaMemcpy(dev_idata, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + checkCUDAErrorFn("cudaMemcpy dev_idata failed!"); + timer().startGpuTimer(); - // TODO + + for (int i = 0; i < ilog2ceil(n); i++) + { + kern_scan_global <<>> (i + 1, n, dev_idata, dev_odata); + std::swap(dev_idata, dev_odata); + } + kern_inclusive_to_exclusive <<>> (n, dev_idata, dev_odata); + timer().endGpuTimer(); + cudaMemcpy(odata, dev_odata, sizeof(int) * n, cudaMemcpyDeviceToHost); + checkCUDAErrorFn("cudaMemcpy dev_odata failed!"); + + cudaFree(dev_idata); + checkCUDAErrorFn("cudaFree dev_idata failed!"); + cudaFree(dev_odata); + checkCUDAErrorFn("cudaFree dev_odata failed!"); + } + } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 1def45e..81d4fa9 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -18,11 +18,12 @@ namespace StreamCompaction { * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + thrust::host_vector h_idata(idata, idata + n); + thrust::device_vector d_idata = h_idata; timer().startGpuTimer(); - // TODO use `thrust::exclusive_scan` - // example: for device_vectors dv_in and dv_out: - // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + thrust::exclusive_scan(d_idata.begin(), d_idata.end(), d_idata.begin()); timer().endGpuTimer(); + thrust::copy(d_idata.begin(), d_idata.end(), odata); } } }