diff --git a/Performance Data.xlsx b/Performance Data.xlsx
new file mode 100644
index 0000000..138d8af
Binary files /dev/null and b/Performance Data.xlsx differ
diff --git a/README.md b/README.md
index 98dd9a8..948a223 100644
--- a/README.md
+++ b/README.md
@@ -1,10 +1,156 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Michael Willett
+* Tested on: Windows 10, I5-4690k @ 3.50GHz 8.00GB, GTX 750-TI 2GB (Personal Computer)
 
-### (TODO: Your README)
+## Contents
+1. [Introduction](#intro)
+2. [Naive Solution](#part1)
+3. [Neighborhood Search](#part2)
+4. [Performance Analysis](#part3)
+5. [Development Process](#part4)
+6. [Build Instructions](#appendix)
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+
+<a name="intro"/>
+## Introduction: Flocking Simulation
+This project explores introductory concepts of GPU paralization methods for simulating flocking behaviors
+of simple particles known as boids. Boid motion is based off of three rules calculated from nearby particles:
+
+1. *Cohesion* - Boids will move towards the center of mass of nearby boids
+2. *Separation* - Boids will try to maintain a minimum distance from one another to avoid collision
+3. *Alignment* - Boids in a group will try to align vector headings with those in the group
+
+These simple rules with the appropriate tuning parameter set can lead to a surprisingly complex emergent 
+behavior very similar to how schools of fish or flocks of birds move in nature, as seen below.
+
+![Flocking Simulation with Grid Pruning](images/uniform_5000.gif)
+*Boid Flocking Simulation*
+
+<a name="part1"/>
+## Section 1: Naive Solution
+The boids flocking simulation is naively calculated by comparing euclidean distance from the current
+boid to every other boid in the simulation, and checking if the distance is within the desired range for
+the rule being calculated (we use a smaller distance metric for calculating separation, otherwise the boids
+never exhibit flocking behavior).
+
+While computationally this results in the correct behavior, it can be wasteful in the number of comparison operations
+since we only apply position and velocity updates if a boid has at least one other particle close to it. For smaller
+particle counts, this method can achieve 60 fps on a cheap modern process, but scales very poorly as the number of 
+comparisons increases. Detailed analysis is available in [Section 3: Performance Analysis](#part-3).
+
+<a name="part2"/>
+## Section 2: Neighborhood Search
+To improve the performance on the naive solution, the physical volue can be descretized into a unform grid, with each
+voxel index linking to a list of boids contained inside each cell. This approach include some additionaly computation
+overhead to properly sort each boid into its correct bin, but limits the search for neighboring boids to only those 
+contained in nearby cells. If the the descritization is set such that the cell edge size is twice the effective distance 
+for velocity updates, the neighborhood search is limited to boids contained in at most eight cells.
+
+In particular, this approach has two major additional steps, of which both are easily unwrapped into independent parallel
+processes. The first step is to assign each boid to a cell, which is fortunately just a computation based off of its 
+current position and the grid resolution, so overhead is minimal. 
+
+The second major step is to sort all boids by their cell index, and compute a list of all boids within each cell. CUDA
+fortunately provides a very fast unstable sort operation. Once the boids are sorted by cell index, all boids in each cell
+can be sorted in parallel solely by identifing transitions between cells in the sorted array.
+
+[Section 3: Performance Analysis](#part-3) outlines the significant performance improvements over the naive solution. It also
+shows a small improvement by adding an additional overhead step to sort the position and velocity arrays directly, rather 
+than doing an array lookup for each update step. The idea behind this improvement is to remove the array lookup time required
+when running a loop over all the boids in cell for a single kernel instance, and address the position/velocity vectors directly. 
+This extra step is refered to as a coherent array, while the unsorted position/velocity implementation will be called the 
+scattered array.
+
+<a name="part3"/>
+## Section 3: Performance Analysis
+*All benchmarks were formed with a cohesion radius of 5 units, separation radius of 3 units, and alignment raidus of 5 units.
+Performance was tested relative to the number of boids in a cube with edge length 100 units.*
+
+Reducing the search space with a uniform grid shows significant improvements over the naive solution. Experiments show that
+this implementation scales exponentially, as expected since complexity os O(n^2). The uniform grid implementation shows much 
+better scaling as the number of boids increases, with the most significant improvement at 30,000 boids, which is the largest 
+amount tested. The additional position/velocity sorting shows 20% to 30% improvement in tests with significantly more boids 
+than cells.
+
+![Performance of Naive vs. Discrete Grid Perofmrance](images/performance.png)
+
+<a name="part4"/>
+## Section 4: Development Process
+Development of the coherent grid involved several development bugs. The first several implementation steps involved sorting
+both the position and velocity arrays directly with replacement on the original array. This showed an interesting artifact
+where the general boid flocking behavior was mostly intact in two dimensions, but boids seemed to align along the third 
+(shown below).
+
+![Buffer Overwrite Failure](images/uniform_5000_fail1.gif)
+*Failure From Overwiting initial Position/Velocity Arrays*
+
+Originally I had determined that the cause was due to sorting the position and velocity frames prior to binning the boids
+into their respective cells. Switching the order of these steps at first glance seems to solve the behavior. However, after
+observing the behavior for longer time periods, all boids seem to develop the same uniform velocity regardless of their 
+distance to neighboring flocks.
+
+It was finally determined that the route cause was that in the kernel invocation, the boid being tested must come from the
+unsorted array, while the boids the current index is being tested against must come from the sorted arrays. This final
+implementation results in up to a 20% improvement depending on density of the boids in the simulation.
+
+<a name="appendix"/>
+## Appendix: Build Instructions
+
+* `src/` contains the source code.
+* `external/` contains the binaries and headers for GLEW and GLFW.
+
+**CMake note:** Do not change any build settings or add any files to your
+project directly (in Visual Studio, Nsight, etc.) Instead, edit the
+`src/CMakeLists.txt` file. Any files you add must be added here. If you edit it,
+just rebuild your VS/Nsight project to make it update itself.
+
+#### Windows
+
+1. In Git Bash, navigate to your cloned project directory.
+2. Create a `build` directory: `mkdir build`
+   * (This "out-of-source" build makes it easy to delete the `build` directory
+     and try again if something goes wrong with the configuration.)
+3. Navigate into that directory: `cd build`
+4. Open the CMake GUI to configure the project:
+   * `cmake-gui ..` or `"C:\Program Files (x86)\cmake\bin\cmake-gui.exe" ..`
+     * Don't forget the `..` part!
+   * Make sure that the "Source" directory is like
+     `.../Project1-CUDA-Flocking`.
+   * Click *Configure*.  Select your version of Visual Studio, Win64.
+     (**NOTE:** you must use Win64, as we don't provide libraries for Win32.)
+   * If you see an error like `CUDA_SDK_ROOT_DIR-NOTFOUND`,
+     set `CUDA_SDK_ROOT_DIR` to your CUDA install path. This will be something
+     like: `C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v7.5`
+   * Click *Generate*.
+5. If generation was successful, there should now be a Visual Studio solution
+   (`.sln`) file in the `build` directory that you just created. Open this.
+   (from the command line: `explorer *.sln`)
+6. Build. (Note that there are Debug and Release configuration options.)
+7. Run. Make sure you run the `cis565_` target (not `ALL_BUILD`) by
+   right-clicking it and selecting "Set as StartUp Project".
+   * If you have switchable graphics (NVIDIA Optimus), you may need to force
+     your program to run with only the NVIDIA card. In NVIDIA Control Panel,
+     under "Manage 3D Settings," set "Multi-display/Mixed GPU acceleration"
+     to "Single display performance mode".
+
+#### OS X & Linux
+
+It is recommended that you use Nsight.
+
+1. Open Nsight. Set the workspace to the one *containing* your cloned repo.
+2. *File->Import...->General->Existing Projects Into Workspace*.
+   * Select the Project 0 repository as the *root directory*.
+3. Select the *cis565-* project in the Project Explorer. From the *Project*
+   menu, select *Build All*.
+   * For later use, note that you can select various Debug and Release build
+     configurations under *Project->Build Configurations->Set Active...*.
+4. If you see an error like `CUDA_SDK_ROOT_DIR-NOTFOUND`:
+   * In a terminal, navigate to the build directory, then run: `cmake-gui ..`
+   * Set `CUDA_SDK_ROOT_DIR` to your CUDA install path.
+     This will be something like: `/usr/local/cuda`
+   * Click *Configure*, then *Generate*.
+5. Right click and *Refresh* the project.
+6. From the *Run* menu, *Run*. Select "Local C/C++ Application" and the
+   `cis565_` binary.
diff --git a/images/coherent_5000.gif b/images/coherent_5000.gif
new file mode 100644
index 0000000..4f5d9c2
Binary files /dev/null and b/images/coherent_5000.gif differ
diff --git a/images/naive_5000.gif b/images/naive_5000.gif
new file mode 100644
index 0000000..1b40585
Binary files /dev/null and b/images/naive_5000.gif differ
diff --git a/images/performance.png b/images/performance.png
new file mode 100644
index 0000000..6ace735
Binary files /dev/null and b/images/performance.png differ
diff --git a/images/uniform_5000.gif b/images/uniform_5000.gif
new file mode 100644
index 0000000..049ae12
Binary files /dev/null and b/images/uniform_5000.gif differ
diff --git a/images/uniform_5000_fail1.gif b/images/uniform_5000_fail1.gif
new file mode 100644
index 0000000..221a1e1
Binary files /dev/null and b/images/uniform_5000_fail1.gif differ
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index fdd636d..dff0113 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -10,5 +10,5 @@ set(SOURCE_FILES
 
 cuda_add_library(src
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_50
     )
diff --git a/src/kernel.cu b/src/kernel.cu
index 30356b9..fd4416a 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -5,6 +5,7 @@
 #include <glm/glm.hpp>
 #include "utilityCore.hpp"
 #include "kernel.h"
+#include "device_launch_parameters.h"
 
 // LOOK-2.1 potentially useful for doing grid-based neighbor search
 #ifndef imax
@@ -15,6 +16,14 @@
 #define imin( a, b ) ( ((a) < (b)) ? (a) : (b) )
 #endif
 
+#ifndef clamp
+#define clamp(x, lo, hi) (x < lo) ? lo : (x > hi) ? hi : x
+#endif
+
+#ifndef wrap
+#define wrap(x, lo, hi) (x < lo) ? x + (hi - lo) : (x > hi) ? x - (hi - lo) : x
+#endif
+
 #define checkCUDAErrorWithLine(msg) checkCUDAError(msg, __LINE__)
 
 /**
@@ -54,6 +63,8 @@ void checkCUDAError(const char *msg, int line = -1) {
 /*! Size of the starting area in simulation space. */
 #define scene_scale 100.0f
 
+//#define gridFactor 2.0f
+
 /***********************************************
 * Kernel state (pointers are device pointers) *
 ***********************************************/
@@ -85,6 +96,8 @@ int *dev_gridCellEndIndices;   // to this cell?
 
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
+glm::vec3 *dev_pos_sorted;
+glm::vec3 *dev_vel_sorted;
 
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
@@ -147,17 +160,17 @@ void Boids::initSimulation(int N) {
 
   cudaMalloc((void**)&dev_vel1, N * sizeof(glm::vec3));
   checkCUDAErrorWithLine("cudaMalloc dev_vel1 failed!");
-
+	
   cudaMalloc((void**)&dev_vel2, N * sizeof(glm::vec3));
   checkCUDAErrorWithLine("cudaMalloc dev_vel2 failed!");
-
+	
   // LOOK-1.2 - This is a typical CUDA kernel invocation.
   kernGenerateRandomPosArray<<<fullBlocksPerGrid, blockSize>>>(1, numObjects,
     dev_pos, scene_scale);
   checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
 
   // LOOK-2.1 computing grid params
-  gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+	gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
   int halfSideCount = (int)(scene_scale / gridCellWidth) + 1;
   gridSideCount = 2 * halfSideCount;
 
@@ -169,6 +182,24 @@ void Boids::initSimulation(int N) {
   gridMinimum.z -= halfGridWidth;
 
   // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+	
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+	
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+	cudaMalloc((void**)&dev_pos_sorted, N * sizeof(glm::vec3));
+	checkCUDAErrorWithLine("cudaMalloc dev_pos_sorted failed!");
+
+	cudaMalloc((void**)&dev_vel_sorted, N * sizeof(glm::vec3));
+	checkCUDAErrorWithLine("cudaMalloc dev_vel_sorted failed!");
+
   cudaThreadSynchronize();
 }
 
@@ -223,6 +254,38 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 * stepSimulation *
 ******************/
 
+
+/**
+* Helper function to compute unscaled velocity update for a single pair of boids
+*/
+__device__ int computeVelocityChangePair(
+	const glm::vec3 &pos1, const glm::vec3 &vel1, 
+	const glm::vec3 &pos2, const glm::vec3 &vel2,
+	glm::vec3 &cen, glm::vec3 &sep, glm::vec3 &coh ) {
+
+	int n = 0;
+	glm::vec3 dv;
+
+	// RULE 1:  Move to center of mass
+	float dist = glm::length(pos2 - pos1);
+	if (dist < rule1Distance) {
+		cen += pos2;
+		n++;
+	}
+
+	// RULE 2:  Maintain minimum distance from neighbors
+	if (dist < rule2Distance) {
+		sep -= (pos2 - pos1);
+	}
+
+	// RULE 3:  Align Velocities
+	if (dist < rule3Distance) {
+		coh += vel2;
+	}
+	
+  return n;
+}
+
 /**
 * LOOK-1.2 You can use this as a helper for kernUpdateVelocityBruteForce.
 * __device__ code can be called from a __global__ context
@@ -230,10 +293,25 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 * in the `pos` and `vel` arrays.
 */
 __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
-  // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
-  // Rule 2: boids try to stay a distance d away from each other
-  // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+
+	glm::vec3 dv = glm::vec3(0.0);
+	glm::vec3 center = glm::vec3(0.0);
+	glm::vec3 separate = glm::vec3(0.0);
+	glm::vec3 cohesion = glm::vec3(0.0);
+	float nBoids = 0;
+
+  for(int i = 0; i < N; i++) {
+		if (i != iSelf) {
+			nBoids += computeVelocityChangePair(pos[iSelf], vel[iSelf], pos[i], vel[i], center, separate, cohesion) ;
+		}
+  }
+
+	if (nBoids > 0) {
+		center /= nBoids;
+		dv = (center - pos[iSelf]) * rule1Scale + cohesion * rule3Scale + separate * rule2Scale;
+	}
+
+  return dv;
 }
 
 /**
@@ -245,6 +323,18 @@ __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
   // Compute a new velocity based on pos and vel1
   // Clamp the speed
   // Record the new velocity into vel2. Question: why NOT vel1?
+
+  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+  if (index >= N) {
+    return;
+  }
+
+	glm::vec3 dv = computeVelocityChange(N, index, pos, vel1);
+	glm::vec3 newVel = glm::vec3(0.0);
+	
+	newVel = vel2[index] + dv;
+	float speed = glm::length(newVel);
+	vel2[index] = (speed <= maxSpeed) ? newVel : newVel * maxSpeed / speed;
 }
 
 /**
@@ -261,13 +351,14 @@ __global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) {
   thisPos += vel[index] * dt;
 
   // Wrap the boids around so we don't lose them
-  thisPos.x = thisPos.x < -scene_scale ? scene_scale : thisPos.x;
-  thisPos.y = thisPos.y < -scene_scale ? scene_scale : thisPos.y;
-  thisPos.z = thisPos.z < -scene_scale ? scene_scale : thisPos.z;
+	float wrapScale = 1.0f;
+  thisPos.x = thisPos.x < -scene_scale ? scene_scale * wrapScale : thisPos.x;
+  thisPos.y = thisPos.y < -scene_scale ? scene_scale * wrapScale : thisPos.y;
+  thisPos.z = thisPos.z < -scene_scale ? scene_scale * wrapScale : thisPos.z;
 
-  thisPos.x = thisPos.x > scene_scale ? -scene_scale : thisPos.x;
-  thisPos.y = thisPos.y > scene_scale ? -scene_scale : thisPos.y;
-  thisPos.z = thisPos.z > scene_scale ? -scene_scale : thisPos.z;
+  thisPos.x = thisPos.x > scene_scale ? -scene_scale * wrapScale : thisPos.x;
+  thisPos.y = thisPos.y > scene_scale ? -scene_scale * wrapScale : thisPos.y;
+  thisPos.z = thisPos.z > scene_scale ? -scene_scale * wrapScale : thisPos.z;
 
   pos[index] = thisPos;
 }
@@ -285,10 +376,22 @@ __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
 __global__ void kernComputeIndices(int N, int gridResolution,
   glm::vec3 gridMin, float inverseCellWidth,
   glm::vec3 *pos, int *indices, int *gridIndices) {
-    // TODO-2.1
-    // - Label each boid with the index of its grid cell.
-    // - Set up a parallel array of integer indices as pointers to the actual
-    //   boid data in pos and vel1/vel2
+  // TODO-2.1
+  // - Label each boid with the index of its grid cell.
+  // - Set up a parallel array of integer indices as pointers to the actual
+  //   boid data in pos and vel1/vel2
+
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+  if (index >= N) {
+    return;
+  }
+	
+	// calculate the index from grid dimensions
+	glm::vec3 temp = (pos[index] - gridMin) * inverseCellWidth;
+	gridIndices[index] = gridIndex3Dto1D((int) temp.x, (int) temp.y, (int) temp.z, gridResolution);
+
+	// for sorting a dictionary
+	indices[index] = index;
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
@@ -306,8 +409,73 @@ __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   // Identify the start point of each cell in the gridIndices array.
   // This is basically a parallel unrolling of a loop that goes
   // "this index doesn't match the one before it, must be a new cell!"
+			
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+    return;
+  }
+
+	int Cell = particleGridIndices[index];
+
+	if (index <= 0)	{
+		gridCellStartIndices[Cell] = index;
+			
+		if (particleGridIndices[index + 1] != Cell)
+			gridCellEndIndices[Cell] = index;
+	} else if (index >= (N-1)) {
+		if (particleGridIndices[index - 1] != Cell)
+			gridCellStartIndices[Cell] = index;
+
+		gridCellEndIndices[Cell] = index;
+	} else {
+		if (particleGridIndices[index - 1] != Cell)
+			gridCellStartIndices[Cell] = index;
+			
+		if (particleGridIndices[index + 1] != Cell)
+			gridCellEndIndices[Cell] = index;
+	}
+
+}
+
+/**
+* Helper function to find neighboring grid cells to search
+*/
+__device__ void computeNeighborList(int gridResolution, float cellWidth, glm::vec3 &pos, int *Neighbors ) {
+
+	int gridCount = gridResolution*gridResolution*gridResolution;
+	
+	// x-axis
+	if (std::fmod(pos.x, cellWidth) < cellWidth/2) {	
+		Neighbors[1] = wrap(Neighbors[0] - 1, 0, gridCount);
+	} else {
+		Neighbors[1] = wrap(Neighbors[0] + 1, 0, gridCount);	
+	}
+
+	// y-axis
+	if (std::fmod(pos.y, cellWidth) < cellWidth/2) {
+		Neighbors[2] = wrap(Neighbors[0] - gridResolution, 0, gridCount);
+		Neighbors[3] = wrap(Neighbors[1] - gridResolution, 0, gridCount);
+	} else {
+		Neighbors[2] = wrap(Neighbors[0] + gridResolution, 0, gridCount);
+		Neighbors[3] = wrap(Neighbors[1] + gridResolution, 0, gridCount);
+	}
+
+	// z-axis	
+	if (std::fmod(pos.z, cellWidth) < cellWidth/2) {
+		Neighbors[4] = wrap(Neighbors[0] - gridResolution*gridResolution, 0, gridCount);
+		Neighbors[5] = wrap(Neighbors[1] - gridResolution*gridResolution, 0, gridCount);
+		Neighbors[6] = wrap(Neighbors[2] - gridResolution*gridResolution, 0, gridCount);
+		Neighbors[7] = wrap(Neighbors[3] - gridResolution*gridResolution, 0, gridCount);
+	} else {
+		Neighbors[4] = wrap(Neighbors[0] + gridResolution*gridResolution, 0, gridCount);
+		Neighbors[5] = wrap(Neighbors[1] + gridResolution*gridResolution, 0, gridCount);
+		Neighbors[6] = wrap(Neighbors[2] + gridResolution*gridResolution, 0, gridCount);
+		Neighbors[7] = wrap(Neighbors[3] + gridResolution*gridResolution, 0, gridCount);
+	}
+
 }
 
+
 __global__ void kernUpdateVelNeighborSearchScattered(
   int N, int gridResolution, glm::vec3 gridMin,
   float inverseCellWidth, float cellWidth,
@@ -322,13 +490,52 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+				
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+    return;
+  }
+	
+	// Current Cell Index
+	glm::vec3 temp = (pos[index] - gridMin) * inverseCellWidth;
+	int CurrentCell = gridIndex3Dto1D((int) temp.x, (int) temp.y, (int) temp.z, gridResolution);
+
+	// Neighboring Cells
+	int Neighbors[8] = {CurrentCell};
+	computeNeighborList(gridResolution, cellWidth, pos[index], Neighbors);
+
+	// check all boids in neighboring cells
+	glm::vec3 dv = glm::vec3(0.0);
+	glm::vec3 center = glm::vec3(0.0);
+	glm::vec3 separate = glm::vec3(0.0);
+	glm::vec3 cohesion = glm::vec3(0.0);
+	float nBoids = 0.0;
+
+	for (int i = 0; i < 8; i++) {
+		if (gridCellStartIndices[Neighbors[i]] != -1) {
+			for (int j = gridCellStartIndices[Neighbors[i]]; j <= gridCellEndIndices[Neighbors[i]]; j++) {
+				nBoids += computeVelocityChangePair(pos[index], vel1[index], pos[particleArrayIndices[j]], vel1[particleArrayIndices[j]], center, separate, cohesion);
+			}
+		}
+	}
+
+	if (nBoids > 0) {
+		center /= nBoids;
+		dv = (center - pos[index]) * rule1Scale + cohesion * rule3Scale + separate * rule2Scale;
+	}
+	
+	// calculate new velocity and clamp
+	glm::vec3 newVel = vel2[index] + dv;
+	float speed = glm::length(newVel);
+	vel2[index] = (speed <= maxSpeed) ? newVel : newVel * maxSpeed / speed;
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
   int N, int gridResolution, glm::vec3 gridMin,
   float inverseCellWidth, float cellWidth,
   int *gridCellStartIndices, int *gridCellEndIndices,
-  glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
+	glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2,
+	glm::vec3 *pos_sort, glm::vec3 *vel_sort){
   // TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
   // except with one less level of indirection.
   // This should expect gridCellStartIndices and gridCellEndIndices to refer
@@ -341,6 +548,54 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+					
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+    return;
+  }
+	
+	// Current Cell Index
+	glm::vec3 temp = (pos[index] - gridMin) * inverseCellWidth;
+	int CurrentCell = gridIndex3Dto1D((int) temp.x, (int) temp.y, (int) temp.z, gridResolution);
+
+	// Neighboring Cells
+	int Neighbors[8] = {CurrentCell};
+	computeNeighborList(gridResolution, cellWidth, pos[index], Neighbors);
+
+	// check all boids in neighboring cells
+	glm::vec3 dv = glm::vec3(0.0);
+	glm::vec3 center = glm::vec3(0.0);
+	glm::vec3 separate = glm::vec3(0.0);
+	glm::vec3 cohesion = glm::vec3(0.0);
+	float nBoids = 0.0;
+
+	for (int i = 0; i < 8; i++) {
+		if (gridCellStartIndices[Neighbors[i]] != -1) {
+			for (int j = gridCellStartIndices[Neighbors[i]]; j <= gridCellEndIndices[Neighbors[i]]; j++) {
+				nBoids += computeVelocityChangePair(pos[index], vel1[index], pos_sort[j], vel_sort[j], center, separate, cohesion);
+			}
+		}
+	}
+
+	if (nBoids > 0) {
+		center /= nBoids;
+		dv = (center - pos[index]) * rule1Scale + cohesion * rule3Scale + separate * rule2Scale;
+	}
+	
+	// calculate new velocity and clamp
+	glm::vec3 newVel = vel2[index] + dv;
+	float speed = glm::length(newVel);
+	vel2[index] = (speed <= maxSpeed) ? newVel : newVel * maxSpeed / speed;
+}
+
+// Use this to pull velocity and position data out of the Wrapper container
+__global__ void kernSortArray(int N, int *indices, glm::vec3 *vec1, glm::vec3 *vec2) {
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+
+	vec2[index] = vec1[indices[index]];
 }
 
 /**
@@ -349,6 +604,17 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
 void Boids::stepSimulationNaive(float dt) {
   // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
   // TODO-1.2 ping-pong the velocity buffers
+	
+  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	
+	// Swap Velocity Buffers
+	glm::vec3 *temp = dev_vel1;
+	dev_vel1 = dev_vel2;
+	dev_vel2 = temp;
+
+	// Update Velocities and Positions
+	kernUpdateVelocityBruteForce<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_pos, dev_vel1, dev_vel2);
+	kernUpdatePos<<<fullBlocksPerGrid, blockSize>>>(numObjects, dt, dev_pos, dev_vel2);
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
@@ -364,6 +630,32 @@ void Boids::stepSimulationScatteredGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed
+
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+	// Swap Velocity Buffers
+	cudaMemcpy(dev_vel1, dev_vel2, numObjects * sizeof(glm::vec3), cudaMemcpyDeviceToDevice);
+
+	// Bin the Boids
+	kernComputeIndices<<<fullBlocksPerGrid, blockSize>>>(numObjects, gridSideCount,  gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+	
+	// Sort
+	thrust::device_ptr<int> dev_thrust_keys(dev_particleGridIndices);		
+  thrust::device_ptr<int> dev_thrust_values(dev_particleArrayIndices);
+  thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + numObjects, dev_thrust_values);
+
+	// Calculate boids in each cell
+	kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_gridCellStartIndices, -1);
+	kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_gridCellEndIndices, -1);
+	kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+
+	// Update Velocities and Positions
+	kernUpdateVelNeighborSearchScattered<<<fullBlocksPerGrid, blockSize>>>(
+		numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth,
+		dev_gridCellStartIndices, dev_gridCellEndIndices,	dev_particleArrayIndices,
+		dev_pos, dev_vel1, dev_vel2);
+
+	kernUpdatePos<<<fullBlocksPerGrid, blockSize>>>(numObjects, dt, dev_pos, dev_vel2);
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
@@ -382,6 +674,36 @@ void Boids::stepSimulationCoherentGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+
+
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+	// Swap Velocity Buffers
+	cudaMemcpy(dev_vel1, dev_vel2, numObjects * sizeof(glm::vec3), cudaMemcpyDeviceToDevice);
+
+	// Bin the Boids
+	kernComputeIndices<<<fullBlocksPerGrid, blockSize>>>(numObjects, gridSideCount,  gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+		
+	// Sort	
+	thrust::device_ptr<int> dev_thrust_keys(dev_particleGridIndices);
+	thrust::device_ptr<int> dev_thrust_values(dev_particleArrayIndices);
+	thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + numObjects, dev_thrust_values);
+	
+	// Reoder arrays
+	kernSortArray << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_particleArrayIndices, dev_vel1, dev_vel_sorted);
+	kernSortArray << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_particleArrayIndices, dev_pos, dev_pos_sorted);
+
+	// Calculate boids in each cell
+	kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_gridCellStartIndices, -1);
+	kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_gridCellEndIndices, -1);
+	kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+
+	// Update Velocities and Positions
+	kernUpdateVelNeighborSearchCoherent<<<fullBlocksPerGrid, blockSize>>>(
+		numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth,
+		dev_gridCellStartIndices, dev_gridCellEndIndices,	dev_pos, dev_vel1, dev_vel2, dev_pos_sorted, dev_vel_sorted);
+
+	kernUpdatePos<<<fullBlocksPerGrid, blockSize>>>(numObjects, dt, dev_pos, dev_vel2);
 }
 
 void Boids::endSimulation() {
@@ -390,6 +712,12 @@ void Boids::endSimulation() {
   cudaFree(dev_pos);
 
   // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+	cudaFree(dev_particleArrayIndices);
+  cudaFree(dev_particleGridIndices);
+	cudaFree(dev_gridCellStartIndices);
+	cudaFree(dev_gridCellEndIndices);
+	cudaFree(dev_pos_sorted);
+	cudaFree(dev_vel_sorted);
 }
 
 void Boids::unitTest() {
diff --git a/src/main.cpp b/src/main.cpp
index e416836..ec1a3ea 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -14,8 +14,16 @@
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
+
+// fix the fact that VC++ 2010 is not C99 compliant
+#if defined(_WIN32) || defined(_WIN64)
+#define fmax max
+#define fmin min
+#pragma warning (disable:4996)
+#define snprintf sprintf_s
+#endif
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
 const int N_FOR_VIS = 5000;
@@ -220,21 +228,35 @@ void initShaders(GLuint * program) {
     double timebase = 0;
     int frame = 0;
 
-    Boids::unitTest(); // LOOK-1.2 We run some basic example code to make sure
+		float avgFrame = 0.0f;
+		int totalFrames = 0;
+		double timebase2 = 0;
+
+    //Boids::unitTest(); // LOOK-1.2 We run some basic example code to make sure
                        // your CUDA development setup is ready to go.
 
     while (!glfwWindowShouldClose(window)) {
       glfwPollEvents();
 
       frame++;
+			totalFrames++;
       double time = glfwGetTime();
 
+			// running average time per 100 frames
+			avgFrame = .99*avgFrame + .01*(time - timebase2);
+			timebase2 = time;
+			if (totalFrames % 100 == 0) {
+				printf("Average Execution Time: %f\n", avgFrame);
+				totalFrames = 0;
+			}
+
       if (time - timebase > 1.0) {
         fps = frame / (time - timebase);
-        timebase = time;
-        frame = 0;
+				timebase = time;
+				frame = 0;
       }
 
+
       runCUDA();
 
       std::ostringstream ss;