CIS565-Fall-2016 · itoupeter · Sep 6, 2016 · Sep 9, 2016 · Sep 10, 2016 · Sep 12, 2016
diff --git a/README.md b/README.md
@@ -1,10 +1,50 @@
-**University of Pennsylvania, CIS 565: GPU Programming and Architecture,
-Project 1 - Flocking**
+# University of Pennsylvania, CIS 565: GPU Programming and Architecture
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+## Project 1 - Flocking
+* Liang Peng
+* Tested on: Windows 10, i7-6700HQ @ 2.6GHz, 8GB, GTX 960M (Personal Computer)
 
-### (TODO: Your README)
+## Screenshots
+* Rendering
+<br><img src="images/flocking.gif" width="500">
+* Profiling
+<br><img src="images/profiling.PNG" width="500">
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+## Performance Analysis
+
+### With Visualization
+
+Algorithm | Max Boid Count | Framerate (FPS)
+:---:|:---:|:---:
+Brute-Force | 5,000 | 60
+Scattered Uniform Grid | 55,000 | 60
+Coherent Uniform Grid | 120,000 | 60
+
+### Without Visualization
+
+Algorithm | Boid Count | Framerate (FPS)
+:---:|:---:|:---:
+Brute-Force | 5,000 | 72
+Scattered Uniform Grid | 5,000 | 590
+Coherent Uniform Grid | 5,000 | 640
+
+### Block Size
+
+Boid Count | Block Size | Framerate (FPS)
+:---:|:---:|:---:
+50000 | 16 | 109
+50000 | 32 | 170
+50000 | 64 | 182
+50000 | 128 | 180
+50000 | 256 | 170
+50000 | 1024 | 170
+
+### Conclusion
+* When boid count is small, framerate can be maintained at 60 fps. As  boid count increases, framerate will from some point drop below 60 fps.
+
+* Algorithm used to update boid positions and velocities has large influence on simulation performance.
+  * From 1.2 to 2.1, neighbor search efficiency is greatly improved by using grid index information.
+  * From 2.1 to 2.3, performance is further improved because cache-hit rate is enhanced by grouping data accessed by neighboring threads.
+
+*  Block size and block count has some impact on performance. As block size increases and block count decreases, framerate will rise and at some point drop.
+	* _My speculation_ If block size is too small, block count will be large. Since each block is processed by a core and number of core is limited, number of cycles to handle all blocks will increase. If block size is too big, since capacity of cache in that block shared by its threads is limited, replacement of data in cache will become more frequent and decrease cache-hit rate thus affect performance.
diff --git a/images/flocking.gif b/images/flocking.gif
diff --git a/images/profiling.PNG b/images/profiling.PNG
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
@@ -10,5 +10,5 @@ set(SOURCE_FILES
 
 cuda_add_library(src
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_50
     )