CIS565-Fall-2023 · thecapablesnakekeeper · Sep 18, 2023 · Sep 19, 2023 · Sep 20, 2023 · Sep 20, 2023
diff --git a/README.md b/README.md
@@ -3,12 +3,123 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Saksham Nagpal  
+  * [LinkedIn](https://www.linkedin.com/in/nagpalsaksham/)
+* Tested on: Windows 11 Home, AMD Ryzen 7 6800H Radeon @ 3.2GHz 16GB, NVIDIA GeForce RTX 3050 Ti Laptop GPU 4096MB
 
-### (TODO: Your README)
+Introduction  
+====
+This project implements the **Scan (All-Prefix Sums)** and the **Stream Compaction** algorithms. We first implement the algorithms on the CPU, then they are implemented using CUDA to run on the GPU. A performance analysis comparing the different approaches is presented afterwards.
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+### Scan
+The Scan algorithm operates on an array and computes the prefix-sum by applying an operator to all the preceding elements for each index. We implement this algorithm in 3 different ways:
+1. <b>CPU Scan:</b> A CPU-side version of the algorithm that sequentially adds every number to the next index, accumulating the result.
+2. <b>Naive Scan:</b> A naive implementation that translates the CPU-side algorithm to run on the GPU. Accumulates the result by dividing the array into smaller sub-arrays and adding the corresponding elements.
+3. <b>Work-Efficient Scan:</b> An optimization of the naive scan that treats the input array as a balance binary tree and performs _up-sweep_ and _down-sweep_ to accumulate the result.
 
+
+### Stream Compaction
+Stream Compaction operates on an array based on a given condition and filters out the elements that do not meet that condition, thus 'compacting' the data stream. In this project, we compact arrays of integers and filter out any element if it is 0. We implement this algorithm in 3 different ways:
+1. A simple CPU version,
+2. A CPU version that imitates the parallelized version using _scan_ and _scatter_ passes, and
+3. a GPU version that maps the input to booleans, runs _scan_ on the mapped boolean array, and then runs _scatter_ to get the compacted output.
+
+
+Performance Analysis
+====
+
+* ## Figuring out the appropriate block size
+First, we track the performace of our different implementations against varying block sizes to figure out the most suitable block size vefore continuing our comparison.
+
+| ![](img/time_vs_blocksize_2_21.png) | 
+|:--:| 
+| *Time (ms) VS Block Size using an array of 2<sup>21</sup> elements* |
+
+We see that there is a significant performance increase till increasing the block size to 64, and after that the gain is negligible. For further performance comparisons, we use a block size of 128.
+
+* ## Comparing Scan implementations
+
+| ![](img/scan_time_vs_array_size_blocksize128.png) | 
+|:--:| 
+| ***Scan:** Time (ms) VS Array Size using Block Size of 128* |
+
+Next, we compare how our different implementations perform with respect to varying array sizes. It is clear that thrust's implementation is the fastest, while our naive implementation is the slowest. While it is interesting that the sequential CPU implementation outperforms both the Naive and the Work-Efficient GPU-based methods for the most part, we can see the Work-Efficient implementation catching up to it for larger sized arrays.
+
+* ## Comparing Stream Compaction implementations
+
+| ![](img/compact_time_vs_array_size_blocksize128.png) | 
+|:--:| 
+| ***Stream Compaction:** Time (ms) VS Array Size using Block Size of 128* |
+
+Lastly, we compare the performance of our 3 different implementations of stream compaction. The GPU-based Work Efficient compaction outperforms the other 2 methods significantly.
+
+#### Can you find the performance bottlenecks? Is it memory I/O? Computation? Is it different for each implementation?
+As we can see, the GPU-based implementations (Naive and Work-Efficient) of the Scan algorithm are not that 'efficient' - even the CPU-based approach outperforms them. Some observations that cen be made in this regard are as follows:
+1. One of the reasons for the above bottleneck could be **Warp-Partitioning**. Our  implementations divide our array into sub-arrays and add their corresponding elements. The way we do indexing could cause half of the threads in each warp to stall, thus causing warp divergence. The indexing can potentially be tweaked so that after each division step, we end up with totally free warps that can then be retired and used to schedule other warps.
+2. As an optimization to the Work-Efficient method, we try to launch only as many threads as the number of elements in the divided sub-array at each iteration. This could potentially be the reason why we see this approach catching up to the CPU-implementation's performance in the second graph above, since the parallelized implementation would offset the sequential computation sepcially for larger sized arrays.
+
+Output
+====
+The following tests were ran on array size of **2<sup>21</sup>**, a non-power-of-two array size of **2<sup>21</sup> - 3**, and a block size of **128**.
+```
+****************
+** SCAN TESTS **
+****************
+    [  38  48   2   7   4  39  42   0  33   1   4  46  46 ...  37   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 1.2294ms    (std::chrono Measured)
+    [   0  38  86  88  95  99 138 180 180 213 214 218 264 ... 51329733 51329770 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 1.2644ms    (std::chrono Measured)
+    [   0  38  86  88  95  99 138 180 180 213 214 218 264 ... 51329672 51329705 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 2.62829ms    (CUDA Measured)
+    [   0  38  86  88  95  99 138 180 180 213 214 218 264 ... 51329733 51329770 ]
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 3.84182ms    (CUDA Measured)
+    [   0  38  86  88  95  99 138 180 180 213 214 218 264 ... 51329672 51329705 ]
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 1.43606ms    (CUDA Measured)
+    [   0  38  86  88  95  99 138 180 180 213 214 218 264 ... 51329733 51329770 ]
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 1.27795ms    (CUDA Measured)
+    [   0  38  86  88  95  99 138 180 180 213 214 218 264 ... 51329672 51329705 ]
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.882496ms    (CUDA Measured)
+    [   0  38  86  88  95  99 138 180 180 213 214 218 264 ... 51329733 51329770 ]
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 1.51619ms    (CUDA Measured)
+    [   0  38  86  88  95  99 138 180 180 213 214 218 264 ... 51329672 51329705 ]
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   2   2   2   3   2   3   0   0   3   1   0   2   2 ...   1   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 4.7782ms    (std::chrono Measured)
+    [   2   2   2   3   2   3   3   1   2   2   3   2   1 ...   3   1 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 4.5961ms    (std::chrono Measured)
+    [   2   2   2   3   2   3   3   1   2   2   3   2   1 ...   1   3 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 9.5317ms    (std::chrono Measured)
+    [   2   2   2   3   2   3   3   1   2   2   3   2   1 ...   3   1 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 2.03965ms    (CUDA Measured)
+    [   2   2   2   3   2   3   3   1   2   2   3   2   1 ...   3   1 ]
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 1.76128ms    (CUDA Measured)
+    [   2   2   2   3   2   3   3   1   2   2   3   2   1 ...   1   3 ]
+    passed
+```
diff --git a/img/compact_time_vs_array_size_blocksize128.png b/img/compact_time_vs_array_size_blocksize128.png
diff --git a/img/scan_time_vs_array_size_blocksize128.png b/img/scan_time_vs_array_size_blocksize128.png
diff --git a/img/time_vs_blocksize_2_21.png b/img/time_vs_blocksize_2_21.png
diff --git a/src/main.cpp b/src/main.cpp
@@ -13,16 +13,51 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 21; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
 int *c = new int[SIZE];
 
+int main1(int argc, char* argv[]) {
+    int bs = 16;
+    while (bs <= 1024)
+    {
+        float cpuScan = 0.0;
+        float naiveScan = 0.0;
+        float workEffScan = 0.0;
+        float thrustScan = 0.0;
+        for (int i = 0; i < 100; i++) {
+            genArray(SIZE - 5, a, 50);
+            a[SIZE - 1] = 0;
+
+            zeroArray(SIZE, c);
+            //printDesc("naive scan, non-power-of-two");
+            StreamCompaction::Naive::scan(NPOT, c, a, bs);
+            naiveScan += StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation();
+
+            zeroArray(SIZE, c);
+            //printDesc("work-efficient scan, non-power-of-two");
+            StreamCompaction::Efficient::scan(NPOT, c, a, bs);
+            workEffScan += StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation();
+
+        }
+        std::cout << "   BLOCK SIZE: " << bs << std::endl;
+        std::cout << "   Naive Scan: " << naiveScan/100.f << std::endl;
+        std::cout << "   Work Efficient Scan: " << workEffScan / 100.f << std::endl;
+        bs *= 2;
+    }
+    system("pause"); // stop Win32 console from closing on exit
+    delete[] a;
+    delete[] b;
+    delete[] c;
+    return 0;
+}
+
 int main(int argc, char* argv[]) {
     // Scan tests
-
     printf("\n");
+    std::cout << "   BLOCK SIZE: " << BLOCK_SIZE << std::endl;
     printf("****************\n");
     printf("** SCAN TESTS **\n");
     printf("****************\n");
@@ -51,7 +86,7 @@ int main(int argc, char* argv[]) {
     printDesc("naive scan, power-of-two");
     StreamCompaction::Naive::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
@@ -64,35 +99,35 @@ int main(int argc, char* argv[]) {
     printDesc("naive scan, non-power-of-two");
     StreamCompaction::Naive::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, power-of-two");
     StreamCompaction::Efficient::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, non-power-of-two");
     StreamCompaction::Efficient::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
+    printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
     printDesc("thrust scan, power-of-two");
     StreamCompaction::Thrust::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("thrust scan, non-power-of-two");
     StreamCompaction::Thrust::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
+    printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
     printf("\n");
@@ -137,14 +172,14 @@ int main(int argc, char* argv[]) {
     printDesc("work-efficient compact, power-of-two");
     count = StreamCompaction::Efficient::compact(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedCount, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient compact, non-power-of-two");
     count = StreamCompaction::Efficient::compact(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
     system("pause"); // stop Win32 console from closing on exit

diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -23,7 +23,11 @@ namespace StreamCompaction {
          * which map to 0 will be removed, and elements which map to 1 will be kept.
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-            // TODO
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n) {
+                return;
+            }
+            bools[index] = idata[index] == 0 ? 0 : 1;
         }
 
         /**
@@ -32,7 +36,13 @@ namespace StreamCompaction {
          */
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
-            // TODO
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n) {
+                return;
+            }
+            if (bools[index] == 1) {
+                odata[indices[index]] = idata[index];
+            }
         }
 
     }

diff --git a/stream_compaction/common.h b/stream_compaction/common.h
@@ -12,6 +12,8 @@
 
 #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
 #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
+#define BLOCK_SIZE 128
+
 
 /**
  * Check for CUDA errors; print and exit if there was a problem.

diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -18,8 +18,12 @@ namespace StreamCompaction {
          * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
          */
         void scan(int n, int *odata, const int *idata) {
-            timer().startCpuTimer();
-            // TODO
+            timer().startCpuTimer();    
+            odata[0] = 0;
+            for (int i = 1; i < n; i++)
+            {
+                odata[i] = odata[i - 1] + idata[i - 1];
+            }
             timer().endCpuTimer();
         }
 
@@ -30,9 +34,15 @@ namespace StreamCompaction {
          */
         int compactWithoutScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            int index = 0;
+            for (int i = 0; i < n; i++) {
+                if (idata[i] != 0) {
+                    odata[index] = idata[i];
+                    index++;
+                }
+            }
             timer().endCpuTimer();
-            return -1;
+            return index;
         }
 
         /**
@@ -41,10 +51,37 @@ namespace StreamCompaction {
          * @returns the number of elements remaining after compaction.
          */
         int compactWithScan(int n, int *odata, const int *idata) {
-            timer().startCpuTimer();
-            // TODO
+            int* temp = new int[n];
+            int* scan = new int[n];
+            timer().startCpuTimer();            
+            for (int i = 0; i < n; i++) {
+                if (idata[i] == 0) {
+                    temp[i] = 0;
+                }
+                else {
+                    temp[i] = 1;
+                }
+            }
+
+            //scan
+            scan[0] = 0;
+            for (int i = 1; i < n; i++)
+            {
+                scan[i] = scan[i - 1] + temp[i - 1];
+            }
+
+            //scatter
+            for (int i = 0; i < n; i++) {
+                if (temp[i] == 1) {
+                    odata[scan[i]] = idata[i];
+                }
+            }
+
             timer().endCpuTimer();
-            return -1;
+            int noOfRemaining = scan[n - 1];
+            delete[] temp;
+            delete[] scan;
+            return noOfRemaining;
         }
     }
 }