CIS565-Fall-2023 · SydianAndrewChen · Sep 14, 2023 · Sep 14, 2023 · Sep 14, 2023 · Sep 14, 2023
diff --git a/README.md b/README.md
@@ -3,12 +3,97 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Yian Chen
+  * [LinkedIn](https://www.linkedin.com/in/yian-chen-33a31a1a8/), [personal website](https://sydianandrewchen.github.io/) etc.
+* Tested on: Windows 10, AMD Ryzen 5800 HS with Radeon Graphics CPU @ 3.20GHz 16GB, NVIDIA GeForce RTX3060 Laptop 8GB
 
-### (TODO: Your README)
+### Questions
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
 
+* Compare all of these GPU Scan implementations (Naive, Work-Efficient, and
+  Thrust) to the serial CPU version of Scan. Plot a graph of the comparison
+  (with array size on the independent axis).
+  ![](img/scan-power-of-two.png)
+  ![](img/scan-non-power-of-two.png)
+  * To guess at what might be happening inside the Thrust implementation (e.g.
+    allocation, memory copy), take a look at the Nsight timeline for its
+    execution. Your analysis here doesn't have to be detailed, since you aren't
+    even looking at the code for the implementation.
+  - Timeline of `thrust::exclusive_scan`:
+    ![](img/image.png)
+    From the graph we can see that there are three scans launched at the same time and ended at the same time. 
+    I guess probably that `thrust::exclusive_scan` implemented an algorithm that uses Up Sweep, Down Sweep and Shifting at the same time.
+
+* Write a brief explanation of the phenomena you see here.
+  - As the size of data grow exponentially, the time of CPU scanning algorithm and GPU naive scan will increase most rapidly. 
+    - The CPU scanning algorithm can beaet GPU naive scan algorithm when data size is large. I guess that might be caused by the full exploitation of locality in the CPU scanning algorithm. On the contrary, GPU naive scan will usually cause a lot of global memory acceesses.  
+  - The time of GPU efficient scan and `thrust` scan increase much slower. `thrust` performs the best.  
+* Can you find the performance bottlenecks? Is it memory I/O? Computation? Is it different for each implementation?
+  The performance bottlenecks should exist within the design of each implementation.
+  - For CPU method, the bottlenecks appear because of the lack of parallellism, comparing with GPU methods.
+  - For Naive Scan, the bottlenecks appear because of:
+    - Double buffering. On the positive side, this trick can save us a large amount of time copying memory. However, two large double-bufferr will cause a stable global memory access within each thread. 
+    - Time complexity.
+  - For Efficient Scan:
+    - Redundant thread launching(settled).
+    - Shared memory.
+
+
+* Paste the output of the test program into a triple-backtick block in your
+  README.
+  - Output of the scan (`SIZE = 1 << 26`)
+```
+****************
+** SCAN TESTS **
+****************
+    [   6  12  22   4  21   5  23  32  32  36  46  37  34 ...  47   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 48.9518ms    (std::chrono Measured)
+    [   0   6  18  40  44  65  70  93 125 157 193 239 276 ... 1643792958 1643793005 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 45.263ms    (std::chrono Measured)
+    [   0   6  18  40  44  65  70  93 125 157 193 239 276 ... 1643792920 1643792939 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 63.2586ms    (CUDA Measured)
+    [   0   6  18  40  44  65  70  93 125 157 193 239 276 ... 1643792958 1643793005 ]
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 62.9893ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 24.2463ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 22.6304ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 2.46502ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 2.53338ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   1   0   1   1   3   0   1   0   1   0   0   3   1 ...   2   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 120.001ms    (std::chrono Measured)
+    [   1   1   1   3   1   1   3   1   3   3   2   3   3 ...   3   2 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 123.921ms    (std::chrono Measured)
+    [   1   1   1   3   1   1   3   1   3   3   2   3   3 ...   3   3 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 219.142ms    (std::chrono Measured)
+    [   1   1   1   3   1   1   3   1   3   3   2   3   3 ...   3   2 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 22.6714ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 22.6632ms    (CUDA Measured)
+    passed
+```
diff --git a/img/image.png b/img/image.png
diff --git a/img/scan-non-power-of-two.png b/img/scan-non-power-of-two.png
diff --git a/img/scan-power-of-two.png b/img/scan-power-of-two.png
diff --git a/src/main.cpp b/src/main.cpp
@@ -13,13 +13,13 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 26; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
 int *c = new int[SIZE];
 
-int main(int argc, char* argv[]) {
+void scanTest() {
     // Scan tests
 
     printf("\n");
@@ -51,7 +51,7 @@ int main(int argc, char* argv[]) {
     printDesc("naive scan, power-of-two");
     StreamCompaction::Naive::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
@@ -94,6 +94,9 @@ int main(int argc, char* argv[]) {
     printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
     //printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
+}
+
+void compactionTest() {
 
     printf("\n");
     printf("*****************************\n");
@@ -146,8 +149,16 @@ int main(int argc, char* argv[]) {
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
     //printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
+}
 
+void test() {
+    scanTest();
+    compactionTest();
     system("pause"); // stop Win32 console from closing on exit
+}
+
+int main(int argc, char* argv[]) {
+    test();
     delete[] a;
     delete[] b;
     delete[] c;

diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -24,6 +24,10 @@ namespace StreamCompaction {
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
             // TODO
+            int index = (blockDim.x * blockIdx.x) + threadIdx.x;
+            if (index < n) {
+                bools[index] = bool(idata[index]);
+            }
         }
 
         /**
@@ -33,6 +37,12 @@ namespace StreamCompaction {
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
             // TODO
+            int index = (blockDim.x * blockIdx.x) + threadIdx.x;
+            if (index < n) {
+                if (bools[index]) {
+                    odata[indices[index]] = idata[index];
+                }
+            }
         }
 
     }

diff --git a/stream_compaction/common.h b/stream_compaction/common.h
@@ -12,7 +12,7 @@
 
 #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
 #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
-
+#define blockSize 512
 /**
  * Check for CUDA errors; print and exit if there was a problem.
  */

diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -12,6 +12,17 @@ namespace StreamCompaction {
             return timer;
         }
 
+        void scanCore(int n, int* odata, const int* idata) {
+            //for (int i = 0; i < n; ++i) {
+            //    for (int j = 0; j < i /* exclusive prefix sum */; ++j) {
+            //        odata[i] += idata[j];
+            //    }
+            //}
+            for (int i = 1; i < n; ++i) {
+                odata[i] = odata[i - 1] + idata[i-1];
+            }
+        }
+
         /**
          * CPU scan (prefix sum).
          * For performance analysis, this is supposed to be a simple for loop.
@@ -20,6 +31,7 @@ namespace StreamCompaction {
         void scan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
             // TODO
+            scanCore(n, odata, idata);
             timer().endCpuTimer();
         }
 
@@ -31,8 +43,12 @@ namespace StreamCompaction {
         int compactWithoutScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
             // TODO
+            int oPtr = 0;
+            for (int i = 0; i < n; ++i) {
+                if (idata[i]) odata[oPtr++] = idata[i];
+            }
             timer().endCpuTimer();
-            return -1;
+            return oPtr;
         }
 
         /**
@@ -41,10 +57,21 @@ namespace StreamCompaction {
          * @returns the number of elements remaining after compaction.
          */
         int compactWithScan(int n, int *odata, const int *idata) {
+            int* odata_tmp = new int[n];
             timer().startCpuTimer();
-            // TODO
+
+            for (int i = 0; i < n; ++i) {
+                odata_tmp[i] = !(!idata[i]);
+                if (i) odata_tmp[i] += odata_tmp[i - 1];
+                if (idata[i]) {
+                    odata[odata_tmp[i] - 1] = idata[i];
+                }
+            }
+            int oSize = odata_tmp[n-1];
             timer().endCpuTimer();
-            return -1;
+            delete [] odata_tmp;
+
+            return oSize;
         }
     }
 }
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
@@ -15,10 +15,75 @@ namespace StreamCompaction {
         /**
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
+
+        void scanCore(int n, int* dev_odata) {
+
+            dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
+            // Reduce
+            int offset = 1;
+            for (int d = 0; d < ilog2(n); d++) {
+                int operation_number = n / (offset * 2);
+                dim3 blocksPerGrid((operation_number + (blockSize - 1)) / blockSize);
+                //printf("%d\n", blocksPerGrid.x);
+                //printf("Cut off unneccessary threads\n");
+                if (blocksPerGrid.x == 1) {
+                    kernUpSweep << <1, operation_number >> > (n, offset, dev_odata);
+                }
+                else 
+                    kernUpSweep << <blocksPerGrid, blockSize >> > (n, offset, dev_odata);
+                //kernUpSweep << <fullBlocksPerGrid, blockSize >> > (n, offset, dev_odata); // 0.31504 for power of two
+                checkCUDAError("kernUpSweep failed");
+                offset <<= 1;
+            }
+
+            // Down sweep
+            for (int d = ilog2(n) - 1; d >= 0; d--) {
+                offset = (1 << d);
+                int operation_number = n / (offset * 2);
+                dim3 blocksPerGrid((operation_number + (blockSize - 1)) / blockSize);
+                //printf("%d\n", blocksPerGrid.x);
+                //printf("Cut off unneccessary threads\n");
+                if (blocksPerGrid.x == 1) {
+                    kernDownSweep << <1, operation_number >> > (n, offset, dev_odata);
+                }
+                else
+                    kernDownSweep << <blocksPerGrid, blockSize >> > (n, offset, dev_odata);
+                //kernDownSweep << <fullBlocksPerGrid, blockSize >> > (n, offset, dev_odata);
+                checkCUDAError("kernDownSweep failed");
+            }
+
+        }
         void scan(int n, int *odata, const int *idata) {
+            int padded_n = (1 << ilog2ceil(n));
+            int* dev_odata;
+            cudaMalloc(&dev_odata, padded_n * sizeof(int));
+            cudaMemcpy(dev_odata, idata, n * sizeof(int), cudaMemcpyKind::cudaMemcpyHostToDevice);
             timer().startGpuTimer();
-            // TODO
+            scanCore(padded_n, dev_odata);
             timer().endGpuTimer();
+            cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyKind::cudaMemcpyDeviceToHost);
+            cudaFree(dev_odata);
+        }
+
+        __global__ void kernUpSweep(int n,int offset, int* odata1) {
+            int index = (blockDim.x * blockIdx.x) + threadIdx.x;
+            //printf("%d, %d, %d, %d\n",n,  index, offset, index*offset*2);
+            int arrIndex = index * (offset * 2);
+            if (arrIndex < n) {
+                odata1[arrIndex + offset * 2 - 1] += odata1[arrIndex + offset - 1];
+                odata1[n-1] = 0;
+            }
+        }
+
+        __global__ void kernDownSweep(int n, int offset, int* odata1) {
+            int index = (blockDim.x * blockIdx.x) + threadIdx.x;
+            //printf("%d, %d, %d, %d\n",n,  index, offset, index*offset*2);
+            int arrIndex = index * (offset * 2);
+            if (arrIndex < n) {
+                int t = odata1[arrIndex + offset - 1];
+                odata1[arrIndex + offset - 1] = odata1[arrIndex + offset * 2 - 1];
+                odata1[arrIndex + offset * 2 - 1] += t;
+            }
         }
 
         /**
@@ -31,10 +96,59 @@ namespace StreamCompaction {
          * @returns      The number of elements remaining after compaction.
          */
         int compact(int n, int *odata, const int *idata) {
-            timer().startGpuTimer();
             // TODO
+            /*
+                bools, indices should only be allocated on device
+                odata and idata needs to be copied to device
+            */
+            int padded_n = (1 << ilog2ceil(n));
+
+            int* dev_bools;
+            /* TODO: Check if remaining part is also zero OR DOESN'T MATTER? */
+            cudaMalloc(&dev_bools, padded_n * sizeof(int));
+            checkCUDAError("cudaMalloc dev_bools failed");
+
+            int* dev_idata;
+            cudaMalloc(&dev_idata, n * sizeof(int));
+            checkCUDAError("cudaMalloc dev_idata failed");
+
+            cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyKind::cudaMemcpyHostToDevice);
+            checkCUDAError("cudaMemcpy from idata to dev_idata failed");
+
+            int* dev_indices;
+            cudaMalloc(&dev_indices, padded_n * sizeof(int));
+            checkCUDAError("cudaMalloc dev_indices failed");
+
+            int* dev_odata;
+            cudaMalloc(&dev_odata, n * sizeof(int));
+            checkCUDAError("cudaMalloc dev_odata failed");
+
+            dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
+            StreamCompaction::Common::kernMapToBoolean<<<fullBlocksPerGrid, blockSize>>>(n, dev_bools, dev_idata);
+            checkCUDAError("kernMapToBoolean failed");
+
+            cudaMemcpy(dev_indices, dev_bools, padded_n * sizeof(int), cudaMemcpyKind::cudaMemcpyDeviceToDevice);
+
+            timer().startGpuTimer();
+            scanCore(padded_n, dev_indices);
             timer().endGpuTimer();
-            return -1;
+            StreamCompaction::Common::kernScatter<<<fullBlocksPerGrid, blockSize>>>(n, dev_odata, dev_idata, dev_bools, dev_indices);
+            checkCUDAError("kernScatter failed");
+
+            /* Still got problem here! */
+            int length, last_element;
+            cudaMemcpy(&length, dev_indices + n - 1, sizeof(int), cudaMemcpyKind::cudaMemcpyDeviceToHost);
+            cudaMemcpy(&last_element, dev_bools + n-1, sizeof(int), cudaMemcpyKind::cudaMemcpyDeviceToHost);
+            checkCUDAError("cudaMemcpy from dev_indices[n-1] to length failed");
+            length += last_element; 
+            cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyKind::cudaMemcpyDeviceToHost);
+
+            cudaFree(dev_idata);
+            cudaFree(dev_odata);
+            cudaFree(dev_indices);
+            cudaFree(dev_bools);
+
+            return length;
         }
     }
 }