CIS565-Fall-2018 · IshanRanade · Sep 9, 2018 · Sep 9, 2018 · Sep 9, 2018 · Sep 9, 2018
diff --git a/README.md b/README.md
@@ -1,11 +1,37 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Ishan Ranade
+* Tested on personal computer: Gigabyte Aero 14, Windows 10, i7-7700HQ, GTX 1060
 
-### (TODO: Your README)
+# Boid Simulation
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+![](demo2.gif)
+
+## Features
+
+- Naive implementation
+- Uniform grid without coherence
+- Uniform grid with coherence
+- Performance analysis with various blocksizes and boid counts
+
+## Performance Analysis
+
+Below are graphs for performance for the three algorithms implements for various blocksizes including 128, 256, and 512.
+
+![](blocksize128.JPG)
+
+![](blocksize256.JPG)
+
+![](blocksize512.JPG)
+
+
+To perform my analysis I recorded down the framerates of the simulation as both the blocksize and number of boids changed for each of the three types of algorithm implemented.  I felt that FPS was the best way to determine which combination produced the best results.
+
+The baseline was the Naive method, and the goal was for each of the other two methods to be able to beat the naive performance.  This would be a good test to determine if using the GPU actually produced better results.
+
+For the Naive method changing the number of boids definitely changed the performance because this method was completely dependendent on boid size and did not make use of the massive parallelism.  For the uniform method, increasing the number of boids made the performance worse in all three cases of blocksizes.  This may be because of the lack of caching when looping through the particles in the surrounding cells.  The coherent method did the best with increasing boid count and this may be due to caching.
+
+In general it seemed that the naive method produced the worst results especially as the number of boids increased.  The block size did not affect this method as it did not make use of the massive parallelism.
+
+The uniform method seemed to do better than the coherent method except for when the blocksize was 512.  The coherent grid showed the best performance especially as the boid size increased.  I was surprised though that at low blocksizes and boid counts it seemed to do worse than the uniform method, but seemed to excel as both the blocksize and boid count increased.
diff --git a/blocksize128.JPG b/blocksize128.JPG
diff --git a/blocksize256.JPG b/blocksize256.JPG
diff --git a/blocksize512.JPG b/blocksize512.JPG
diff --git a/data.txt b/data.txt
@@ -0,0 +1,41 @@
+Coherent
+	Boids 1000
+		Blocksize 128 - 450
+		Blocksize 256 - 469
+		Blocksize 512 - 564
+	Boids 5000
+		Blocksize 128 - 398
+		Blocksize 256 - 395
+		Blocksize 512 - 396
+	Boids 10000
+		Blocksize 128 - 485
+		Blocksize 256 - 378
+		Blocksize 512 - 320
+
+Uniform
+	Boids 1000
+		Blocksize 128 - 555
+		Blocksize 256 - 553
+		Blocksize 512 - 496
+	Boids 5000
+		Blocksize 128 - 382
+		Blocksize 256 - 404
+		Blocksize 512 - 392
+	Boids 10000
+		Blocksize 128 - 444
+		Blocksize 256 - 277
+		Blocksize 512 - 242
+
+Naive
+	Boids 1000
+		Blocksize 128 - 460
+		Blocksize 256 - 509
+		Blocksize 512 - 464
+	Boids 5000
+		Blocksize 128 - 253
+		Blocksize 256 - 240
+		Blocksize 512 - 242
+	Boids 10000
+		Blocksize 128 - 121
+		Blocksize 256 - 117
+		Blocksize 512 - 114
diff --git a/data.xlsx b/data.xlsx
diff --git a/demo.gif b/demo.gif
diff --git a/demo2.gif b/demo2.gif
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
@@ -10,5 +10,5 @@ set(SOURCE_FILES
 
 cuda_add_library(src
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_60
     )