CIS565-Fall-2016 · trungtle · Sep 3, 2016 · Sep 3, 2016 · Sep 14, 2016 · Sep 14, 2016
diff --git a/README.md b/README.md
@@ -1,10 +1,79 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Trung Le
+* Windows 10 Home, i7-4790 CPU @ 3.60GHz 12GB, GTX 980 Ti (Personal desktop)
 
-### (TODO: Your README)
+### Flocking
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+#### Description
+
+This project implemented three different algorithms for flocking behavior in CUDA: brute-force, uniform grid, coherent uniform grid.
+
+![alt text](https://github.com/trungtle/Project1-CUDA-Flocking/blob/master/images/screenshots/Uniform_5000_boids.gif "Flocking simulation")
+
+**---- General information for CUDA device ----**
+- Device name: GeForce GTX 980 Ti
+- Compute capability: 5.2
+- Compute mode: Default
+- Clock rate: 1076000
+- Integrated: 0
+- Device copy overlap: Enabled
+- Kernel execution timeout: Enabled
+
+**---- Memory information for CUDA device ----**
+
+- Total global memory: 6442450944
+- Total constant memory: 65536
+- Multiprocessor count: 22
+- Shared memory per multiprocessor: 98304
+- Registers per multiprocessor: 65536
+- Max threads per multiprocessor: 2048
+- Max grid dimensions: [2147483647, 65535, 65535]
+- Max threads per block: 1024
+- Max registers per block: 65536
+- Max thread dimensions: [1024, 1024, 64]
+- Threads per block: 512
+
+
+#### Performance analysis
+
+To run performance analysis, I used CUDA to wrap around the simulation step and turned off visualization.
+
+The boid count tested are: 1000, 5000, 50000, 100000, 150000
+The block size tested are: 1, 128, 512 (maxed at 512 threads per block on my machine)
+
+**Table showing simulation time per frame (in ms) vs. number of boids). Block size is 128.**
+
+| Boid count |	Coherent	| Uniform |	Naive|
+| ---------- | -------- | ------- | ---- |
+|5000|	0.2|	0.2|	4.3|
+|50000|	1.2|	1.5|	166|
+|100000|	1.8|	2.6|	654|
+|500000|	12.5|	20.6|	(crash)|
+|1000000|	52|	74|	(crash)|
+
+![alt text](https://github.com/trungtle/Project1-CUDA-Flocking/blob/master/images/charts/performace.png "Naive vs. Coherent vs scattered uniform grid performance")
+
+For each implmentation, as the number of boids increases, there is a big drop in performance. This is due to the fact that for each boid's velocity computation, we need loop through a list of potential neighbors that can affect the boid. The biggest optimization made in each implementation is a improvement on how efficient we can iterate through these neighbors by partitioning them into a uniform grid and rearrange their data in a coherent memory.
+
+Changing the block size doesn't seem to increases performace. However, I did find an interesting pattern. At every block size that is a power of 2, the performance is optimal. I tested this with 1,000,000 in coherent grid (~46ms max) and scattered uniform grid (~74ms max).
+
+**Table showing block size vs. simulation time per frame (in ms)**
+
+| Block size |	Coherent |	Uniform |
+| ---------- | -------- | ------- |
+|32|	45|	74|
+|35|	74|	113|
+|64|	44|	73|
+|100|	54|	87|
+|128|	46|	74|
+|256|	45|	73|
+|400|	54|	85|
+|512|	46|	73|
+|600|	68|	105|
+|1024|	46|	72|
+
+When comparing between coherent and scattered uniform grids, I did see a great performance improvement with the coherent data implementation as the number of boids increases. From my performance analysis, I observed that at 1000000 boids and kernel block size 128, coherent uniform grid is ~20ms faster than scattered uniform grid. This is a big improvement! This is due to the fact that we take advantage of spatial coherence can access each neighboring boid's data sequentially in memory.
+
+**Note**: I also updated the -arch=sm_52 version to make it compatible for my machine
diff --git a/images/charts/performace.png b/images/charts/performace.png
diff --git a/images/screenshots/Uniform_5000_boids.gif b/images/screenshots/Uniform_5000_boids.gif
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
@@ -10,5 +10,5 @@ set(SOURCE_FILES
 
 cuda_add_library(src
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_52
     )