diff --git a/Performance.pdf b/Performance.pdf
new file mode 100644
index 0000000..207650b
Binary files /dev/null and b/Performance.pdf differ
diff --git a/Performance.xlsx b/Performance.xlsx
new file mode 100644
index 0000000..92d1721
Binary files /dev/null and b/Performance.xlsx differ
diff --git a/Performance_Page_1.png b/Performance_Page_1.png
new file mode 100644
index 0000000..95a119d
Binary files /dev/null and b/Performance_Page_1.png differ
diff --git a/Performance_Page_2.png b/Performance_Page_2.png
new file mode 100644
index 0000000..3de729e
Binary files /dev/null and b/Performance_Page_2.png differ
diff --git a/README.md b/README.md
index 98dd9a8..74e2234 100644
--- a/README.md
+++ b/README.md
@@ -1,10 +1,52 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+![alt text] (https://github.com/lobachevzky/Project1-CUDA-Flocking/blob/working/project1.gif "Running with all possible optimizations")
 
-### (TODO: Your README)
+* Ethan Brooks
+* Tested on: Windows 7, Intel(R) Xeon(R), GeForce GTX 1070 8GB (SIG Lab)
 
+#Flocking Simulation
+
+## Summary
 Include screenshots, analysis, etc. (Remember, this is public, so don't put
 anything here that you don't want to share with the world.)
+
+This project simulates flocking behavior by graphically depicting "boids", or colored points,  in a 3-d environment which obey three rules:
+
+1. Boids try to fly towards the centre of mass of neighbouring boids.
+2. Boids try to keep a small distance away from other objects (including other boids).
+3. Boids try to match velocity with near boids.
+
+These rules cause groups of boids to coalesce into flocks, with all the boids in a flock flying in parallel, close to one another, at similar velocities.
+
+## Optimizations.
+This project includes three implementations: 
+1. A naive one that compares each boid to every other boid
+2. One that only compares boids within neighborhoods.
+3. A further optimization of 2 that minimizes memory access.
+
+### Implementation 1
+Though this implementation does utilize the GPU, calling kernel functions on each boid which are executed in parallel, each of these kernel functions compares its assigned boid with every other boid in the simulation. Thus every kernel function is O(n), where n is the number of boids.
+
+### Implementation 2
+Since all three flocking rules only apply to boids within a certain proximity, it is possible to cut down the number of comparisons by discretizing the 3d space into cells and only comparing boids within the same cells or adjacent cells. Our implementation maps boids to cells based on their (x, y, z) positions. Therefore, given a boid and its position, we can easily identify the id of the cell in which it resides. We subsequently identify cells adjacent to this one in all three directions (27 total). 
+
+The next task is to identify the boids within each of these 27 cells. A naive approach would search all boids in the simulation and identify those within the 27 cells. However, this would defeat the purpose since we would be back to linear time complexity as in Implementation 1. In order to avoid this, we develop a second buffer of pointers to boids, sorted by cell location and then map each cell to a range within this buffer. This way, given a cell, we use this map to identify the range within the second buffer to search. By following the pointers within this second buffer, we can access the boids that are within the cell.
+
+Our time complexity for each kernel invocation is still O(n), but now n is the number of boids in the neighboring cells, not in the entire simulation -- a small fraction of the total number. In practice, as the chart below depicts, the number of boids in neighboring cells remains relatively constant and the execution only increases a little as the number of boids increases.
+
+### Implementation 3
+The previous approach has one major weakness: each cell is mapped to an array of _pointers_ to boids. Therefore when searching within each cell, we actually have to follow a pointer for each boid, and since a given boid is likely in the vicinity of multiple boids, we actually have to follow the pointer for the same boid repeatedly. On the GPU, memory access is slow. In order to minimize memory access, we instead _sort the boids themselves_--that is, we sort the positions and velocities by cell. To do so, we still develop a second buffer that we sort by cell index, but this time we use that buffer to rearrange the actual positions and velocity buffers themselves.
+
+This implementation actually requires us to resort the boids every time step, since boids change position and do not stay within the same cell necessarily. However sorting is cheap on the GPU--O(log n)--and we only perform this step once per timestep, whereas the memory accesses in Implementation 2 happened repeatedly. 
+
+Another advantage of this approach is that it puts adjacent boids close to each other in memory. In Implementation 2, the boid pointers that we were following might point to any location in the position and velocity buffers and therefore to disparate locations in memory. In contrast, this implementation places the positions and velocities for a given cell be next to each other. This helps make memory access quicker by obeying Locality of Reference, which ensures that memory accesses nearby to previous accesses are quicker.
+
+The memory improvements are evident in the following graph:
+![alt text] (https://github.com/lobachevzky/Project1-CUDA-Flocking/blob/master/Performance_Page_1.png "A comparison of performance across implementations.")
+In this graph, time per frame is averaged across 1000 frames.
+
+### Optimization across block sizes
+Finally, we experimented with different block counts. Each experiment was run with 2^15 boids. The results are shown below:
+![alt text] (https://github.com/lobachevzky/Project1-CUDA-Flocking/blob/master/Performance_Page_2.png "A comparison of performance across implementations.")
diff --git a/demo-gif.gif b/demo-gif.gif
new file mode 100644
index 0000000..2fd1278
Binary files /dev/null and b/demo-gif.gif differ
diff --git a/project1.gif b/project1.gif
new file mode 100644
index 0000000..9a0a015
Binary files /dev/null and b/project1.gif differ
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index fdd636d..eeaabd4 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -10,5 +10,5 @@ set(SOURCE_FILES
 
 cuda_add_library(src
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_52
     )
diff --git a/src/kernel.cu b/src/kernel.cu
index aaf0fbf..caa0208 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -5,6 +5,7 @@
 #include <glm/glm.hpp>
 #include "utilityCore.hpp"
 #include "kernel.h"
+#include <string>
 
 // LOOK-2.1 potentially useful for doing grid-based neighbor search
 #ifndef imax
@@ -16,6 +17,20 @@
 #endif
 
 #define checkCUDAErrorWithLine(msg) checkCUDAError(msg, __LINE__)
+#define DEBUG 0
+#define dim 3
+#define maxNumGridSearch (dim * dim * dim)
+#define gridOOB -1
+
+#if DEBUG
+#define debug(...) printf(__VA_ARGS__);
+#define debug0(...) if (index == 0) { printf(__VA_ARGS__); }
+#define debug4000(...) if (index == 4000) { printf(__VA_ARGS__); }
+#else
+#define debug(...) {}
+#define debug0(...) {}
+#define debug4000(...) {}
+#endif
 
 /**
 * Check for CUDA errors; print and exit if there was a problem.
@@ -47,7 +62,7 @@ void checkCUDAError(const char *msg, int line = -1) {
 
 #define rule1Scale 0.01f
 #define rule2Scale 0.1f
-#define rule3Scale 0.1f
+#define rule3Scale 0.01f
 
 #define maxSpeed 1.0f
 
@@ -85,6 +100,8 @@ int *dev_gridCellEndIndices;   // to this cell?
 
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
+glm::vec3 *dev_sortedPos;
+glm::vec3 *dev_sortedVel;
 
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
@@ -98,6 +115,12 @@ glm::vec3 gridMinimum;
 * initSimulation *
 ******************/
 
+
+__device__ int thread_index() {
+    return threadIdx.x + (blockIdx.x * blockDim.x);
+}
+
+
 __host__ __device__ unsigned int hash(unsigned int a) {
   a = (a + 0x7ed55d16) + (a << 12);
   a = (a ^ 0xc761c23c) ^ (a >> 19);
@@ -124,7 +147,7 @@ __host__ __device__ glm::vec3 generateRandomVec3(float time, int index) {
 * CUDA kernel for generating boids with a specified mass randomly around the star.
 */
 __global__ void kernGenerateRandomPosArray(int time, int N, glm::vec3 * arr, float scale) {
-  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  int index = thread_index();
   if (index < N) {
     glm::vec3 rand = generateRandomVec3(time, index);
     arr[index].x = scale * rand.x;
@@ -164,15 +187,28 @@ void Boids::initSimulation(int N) {
   gridCellCount = gridSideCount * gridSideCount * gridSideCount;
   gridInverseCellWidth = 1.0f / gridCellWidth;
   float halfGridWidth = gridCellWidth * halfSideCount;
-  gridMinimum.x -= halfGridWidth;
-  gridMinimum.y -= halfGridWidth;
-  gridMinimum.z -= halfGridWidth;
-
-  // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  gridMinimum.x = 0 - halfGridWidth;
+  gridMinimum.y = 0 - halfGridWidth;
+  gridMinimum.z = 0 - halfGridWidth;
+
+  //// TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+  cudaMalloc((void**)&dev_sortedPos, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_sortedPos failed!");
+  cudaMalloc((void**)&dev_sortedVel, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_sortedVel failed!");
   cudaThreadSynchronize();
 }
 
 
+
 /******************
 * copyBoidsToVBO *
 ******************/
@@ -219,21 +255,139 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 }
 
 
+
 /******************
 * stepSimulation *
 ******************/
 
+
 /**
 * LOOK-1.2 You can use this as a helper for kernUpdateVelocityBruteForce.
 * __device__ code can be called from a __global__ context
 * Compute the new velocity on the body with index `iSelf` due to the `N` boids
 * in the `pos` and `vel` arrays.
 */
-__device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
-  // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
-  // Rule 2: boids try to stay a distance d away from each other
-  // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+__device__ glm::vec3 computeVelocityChange(int start, int end, 
+    int iSelf, const int *particleArrayIndices, const glm::vec3 *pos, const glm::vec3 *vel) {
+    glm::vec3 thisPos = pos[iSelf];
+
+    int neighborCount = 0;
+
+    glm::vec3 center(0.0f);
+    glm::vec3 separate(0.0f);
+    glm::vec3 cohesion(0.0f);
+
+    for (int i = start; i < end; ++i) {
+        if (i == iSelf) continue;
+
+        glm::vec3 thatPos = pos[i];
+
+        float distance = glm::length(thatPos - thisPos);
+
+        // Rule 1: Cohesion: boids fly towards the center of mass of neighboring boids
+        if (distance < rule1Distance) {
+            center += thatPos;
+            neighborCount += 1;
+        }
+
+        // Rule 2: Separation: boids try to keep a small distance away from each other
+        if (distance < rule2Distance) {
+            separate -= thatPos - thisPos;
+        }
+
+        // Rule 3: Alignment: boids try to match the velocities of neighboring boids
+        if (distance < rule3Distance) {
+            cohesion += vel[i];
+        }
+    }
+
+    glm::vec3 toCenter(0.0f);
+    if (neighborCount > 0) {
+        center /= neighborCount;
+        toCenter = (center - thisPos);
+    }
+
+    return toCenter * rule1Scale
+        + separate * rule2Scale
+        + cohesion * rule3Scale;
+}
+
+__device__ glm::vec3 computeVelocityChangeInGrids(
+    int *gridsToSearch, int *gridStartIndices, int *gridEndIndices,
+    int start, int end, int index, const int *indexToBoid, 
+    const glm::vec3 *pos, const glm::vec3 *vel) {
+
+    glm::vec3 thisPos = pos[index];
+
+    int neighborCount = 0;
+
+    glm::vec3 center(0.0f);
+    glm::vec3 separate(0.0f);
+    glm::vec3 cohesion(0.0f);
+
+	int compare = 0;
+
+    for (int j = 0; j < maxNumGridSearch; j++) {
+        int grid = gridsToSearch[j];
+        if (grid == gridOOB) continue;
+
+        start = gridStartIndices[grid];
+        end = gridEndIndices[grid];
+
+        for (int i = start; i < end; i++) {
+			compare++;
+
+            int boid = indexToBoid ? indexToBoid[i] : i;
+            if (boid == index) continue;
+
+            glm::vec3 thatPos = pos[boid];
+            float distance = glm::length(thatPos - thisPos);
+
+            // Rule 1: Cohesion: boids fly towards the center of mass of neighboring boids
+            if (distance < rule1Distance) {
+                center += thatPos;
+                neighborCount += 1;
+            }
+
+
+            // Rule 2: Separation: boids try to keep a small distance away from each other
+            if (distance < rule2Distance) {
+                separate -= thatPos - thisPos;
+            }
+
+            // Rule 3: Alignment: boids try to match the velocities of neighboring boids
+            if (distance < rule3Distance) {
+                cohesion += vel[boid];
+            }
+        }
+    }
+
+    glm::vec3 toCenter(0.0f);
+    if (neighborCount > 0) {
+        center /= neighborCount;
+        toCenter = (center - thisPos);
+    }
+
+    return toCenter * rule1Scale
+        + separate * rule2Scale
+        + cohesion * rule3Scale;
+}
+
+__device__ void updateVelocities(glm::vec3 *vel1, glm::vec3 *vel2, 
+  glm::vec3 acceleration, int index) {
+      glm::vec3 newVel = vel1[index] + acceleration;
+      // - Clamp the speed change before putting the new speed in vel2
+
+      float currentSpeed = glm::length(newVel);
+      float speed = fmin(currentSpeed, maxSpeed);
+
+	  glm::vec3 norm = glm::normalize(newVel);
+	  if (newVel.x == 0 && newVel.y == 0 && newVel.z == 0) {
+		  norm = glm::vec3(0.0);
+	  }
+
+      // Record the new velocity into vel2. Question: why NOT vel1?
+      vel2[index] = norm * speed;
 }
 
 /**
@@ -243,8 +397,13 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
 __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
   glm::vec3 *vel1, glm::vec3 *vel2) {
   // Compute a new velocity based on pos and vel1
-  // Clamp the speed
-  // Record the new velocity into vel2. Question: why NOT vel1?
+  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+  if (index >= N) {
+    return;
+  }
+
+  glm::vec3 acceleration = computeVelocityChange(0, N, index, NULL, pos, vel1);
+  updateVelocities(vel1, vel2, acceleration, index);
 }
 
 /**
@@ -279,49 +438,155 @@ __global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) {
 //            for(y)
 //             for(z)? Or some other order?
 __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
+  if (x < 0 || x >= gridResolution ||
+	  y < 0 || y >= gridResolution ||
+	  z < 0 || z >= gridResolution) {
+	   return gridOOB;
+  }
   return x + y * gridResolution + z * gridResolution * gridResolution;
 }
 
+__device__ glm::vec3 posToGrid(glm::vec3 pos, glm::vec3 gridMin, float inverseCellWidth, int gridResolution) {
+	glm::vec3 grid((pos - gridMin) * inverseCellWidth);
+	grid = glm::floor(grid);
+	int backOff = gridResolution - 1;
+	if ((int)grid.x == gridResolution) grid.x = backOff;
+	if ((int)grid.y == gridResolution) grid.y = backOff;
+	if ((int)grid.z == gridResolution) grid.z = backOff;
+  return grid;
+}
+
+__device__ int posToGridIndex(glm::vec3 pos, glm::vec3 gridMin,
+	float inverseCellWidth, int gridResolution) {
+	glm::vec3 grid = posToGrid(pos, gridMin, inverseCellWidth, gridResolution);
+	return gridIndex3Dto1D((int)grid.x, (int)grid.y, (int)grid.z, gridResolution);
+}
+
+// gridResolution: number of grids per side
+// gridMin: value of most negative point in the grid
 __global__ void kernComputeIndices(int N, int gridResolution,
   glm::vec3 gridMin, float inverseCellWidth,
   glm::vec3 *pos, int *indices, int *gridIndices) {
     // TODO-2.1
     // - Label each boid with the index of its grid cell.
-    // - Set up a parallel array of integer indices as pointers to the actual
-    //   boid data in pos and vel1/vel2
+    int index = thread_index();
+    if (index < N) {
+        glm::vec3 thisPos = pos[index];
+        gridIndices[index] = posToGridIndex(thisPos, gridMin,
+            inverseCellWidth, gridResolution);
+        indices[index] = index;
+    }
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
 //          does not enclose any boids
 __global__ void kernResetIntBuffer(int N, int *intBuffer, int value) {
-  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  int index = thread_index();
   if (index < N) {
     intBuffer[index] = value;
   }
 }
 
+__global__ void test(int *boidIndices, int *grids, 
+    int *gridCellStartIndices, int *gridCellEndIndices) {
+    int x = boidIndices[0];
+    int y = grids[0];
+    int z = gridCellStartIndices[0];
+    int w = gridCellEndIndices[0];
+    //if (1) {
+    //    printf("test");
+    //}
+}
+
 __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   int *gridCellStartIndices, int *gridCellEndIndices) {
   // TODO-2.1
   // Identify the start point of each cell in the gridIndices array.
-  // This is basically a parallel unrolling of a loop that goes
-  // "this index doesn't match the one before it, must be a new cell!"
+    // TODO: eliminate branches
+    int index = thread_index();
+    if (index < N) {
+        int thisGrid = particleGridIndices[index];
+        if (index == 0) {
+          gridCellStartIndices[thisGrid] = 0;
+          return;
+        }
+        int prevGrid = particleGridIndices[index - 1];
+        if (index == (N - 1)) {
+          gridCellEndIndices[thisGrid] = index;
+          if (thisGrid != prevGrid) {
+            gridCellStartIndices[thisGrid] = index;
+          }
+          return;
+        }
+        if (thisGrid != prevGrid) {
+            // "this index doesn't match the one before it, must be a new cell!"
+            gridCellStartIndices[thisGrid] = index;
+            gridCellEndIndices[prevGrid] = index;
+        }
+    }
+}
+
+__device__ void updateVelNeighborSearch(
+  int N, int gridResolution, glm::vec3 gridMin,
+  float inverseCellWidth, float cellWidth,
+  int *gridCellStartIndices, int *gridCellEndIndices,
+  int *particleArrayIndices,
+  glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
+  // TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
+  // except with one less level of indirection.
+  int index = thread_index();
+  if (index < N) {
+      // - Identify the grid cell that this particle is in
+      // - Identify which cells may contain neighbors. This isn't always 8.
+      // - For each cell, read the start/end indices in the boid pointer array.
+      glm::vec3 thisPos = pos[index];
+      glm::vec3 thisGrid = posToGrid(thisPos, gridMin, inverseCellWidth, gridResolution);
+
+      int gridsToSearch[maxNumGridSearch];
+      for (int x = -1; x <= 1; x++) {
+          for (int y = -1; y <= 1; y++) {
+              for (int z = -1; z <= 1; z++) {
+				  int grid = gridIndex3Dto1D(
+					thisGrid.x + x, thisGrid.y + y, thisGrid.z + z, gridResolution);
+				  int n = gridIndex3Dto1D(x + 1, y + 1, z + 1, dim); // map x, y, z to unique int between 1 and maxNumGridSearch
+				  gridsToSearch[n] = grid;
+              }
+          }
+      }
+
+      // - Access each boid in the cell and compute velocity change from
+      //   the boids rules, if this boid is within the neighborhood distance.
+      glm::vec3 acceleration = computeVelocityChangeInGrids(
+          gridsToSearch, gridCellStartIndices, gridCellEndIndices,
+          0, N, index, particleArrayIndices, pos, vel1);
+      updateVelocities(vel1, vel2, acceleration, index);
+  }
 }
 
+
 __global__ void kernUpdateVelNeighborSearchScattered(
   int N, int gridResolution, glm::vec3 gridMin,
   float inverseCellWidth, float cellWidth,
   int *gridCellStartIndices, int *gridCellEndIndices,
   int *particleArrayIndices,
   glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
-  // the number of boids that need to be checked.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+	updateVelNeighborSearch(
+		N, gridResolution, gridMin,
+		inverseCellWidth, cellWidth,
+		gridCellStartIndices, gridCellEndIndices,
+		particleArrayIndices, pos, vel1, vel2);
+}
+
+__global__ void kernSortPosVel(int *particleArrayIndices, 
+	int N, glm::vec3 *sortedPos, glm::vec3 *sortedVel, 
+	glm::vec3 *pos, glm::vec3 *vel) {
+	int index = thread_index();
+  if (index < N) {
+    int oldIndex = particleArrayIndices[index];
+    sortedPos[index] = pos[oldIndex];
+    sortedVel[index] = vel[oldIndex];
+	// TODO: sort velocities
+  }
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -329,18 +594,11 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   float inverseCellWidth, float cellWidth,
   int *gridCellStartIndices, int *gridCellEndIndices,
   glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
-  // except with one less level of indirection.
-  // This should expect gridCellStartIndices and gridCellEndIndices to refer
-  // directly to pos and vel1.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  //   DIFFERENCE: For best results, consider what order the cells should be
-  //   checked in to maximize the memory benefits of reordering the boids data.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+	updateVelNeighborSearch(
+		N, gridResolution, gridMin,
+		inverseCellWidth, cellWidth,
+		gridCellStartIndices, gridCellEndIndices,
+		NULL, pos, vel1, vel2);
 }
 
 /**
@@ -348,40 +606,154 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
 */
 void Boids::stepSimulationNaive(float dt) {
   // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
+    dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+    kernUpdateVelocityBruteForce << < fullBlocksPerGrid, threadsPerBlock >> >(numObjects, dev_pos, dev_vel1, dev_vel2);
+    kernUpdatePos << < fullBlocksPerGrid, threadsPerBlock >> >(numObjects, dt, dev_pos, dev_vel2);
+
   // TODO-1.2 ping-pong the velocity buffers
+    glm::vec3 *swap = dev_vel1;
+    dev_vel1 = dev_vel2;
+    dev_vel2 = swap;
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
   // TODO-2.1
+    dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
   // Uniform Grid Neighbor search using Thrust sort.
   // In Parallel:
   // - label each particle with its array index as well as its grid index.
   //   Use 2x width grids.
+    kernComputeIndices << <fullBlocksPerGrid, threadsPerBlock >> > (numObjects,
+        gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos,
+        dev_particleArrayIndices, dev_particleGridIndices);
   // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
   //   are welcome to do a performance comparison.
+    
+
+    ///////////////////////
+    //#define testN 5
+    //int arrayIndices[testN];
+    //int gridIndices[testN];
+    //glm::vec3 pos[testN];
+
+    //cudaMemcpy(arrayIndices, dev_particleArrayIndices, sizeof(int) * testN, cudaMemcpyDeviceToHost);
+    //cudaMemcpy(gridIndices, dev_particleGridIndices, sizeof(int) * testN, cudaMemcpyDeviceToHost);
+    //cudaMemcpy(pos, dev_pos, sizeof(int) * testN, cudaMemcpyDeviceToHost);
+    //checkCUDAErrorWithLine("memcpy back failed!");
+
+    //std::cout << "before unstable sort: " << std::endl;
+    //for (int i = 0; i < testN; i++) {
+    //  std::cout << "  arrayIndex: " << arrayIndices[i];
+    //  std::cout << " gridIndex: " << gridIndices[i];
+    //  std::cout << " pos: " << pos[i].x << " " << pos[i].y << " " << pos[i].z << std::endl;
+    //}
+    //\\\\\\\\\\\\\\\\\\\\\\\\\
+
+    thrust::device_ptr<int> thrust_particleArrayIndices(dev_particleArrayIndices);
+    thrust::device_ptr<int> thrust_particleGridIndices(dev_particleGridIndices);
+    thrust::sort_by_key(
+        thrust_particleGridIndices,
+        thrust_particleGridIndices + numObjects,
+        thrust_particleArrayIndices);
+
+    ///////////////////////
+    //cudaMemcpy(arrayIndices, dev_particleArrayIndices, sizeof(int) * testN, cudaMemcpyDeviceToHost);
+    //cudaMemcpy(gridIndices, dev_particleGridIndices, sizeof(int) * testN, cudaMemcpyDeviceToHost);
+    //cudaMemcpy(pos, dev_pos, sizeof(int) * testN, cudaMemcpyDeviceToHost);
+    //checkCUDAErrorWithLine("memcpy back failed!");
+
+    //std::cout << "after unstable sort: " << std::endl;
+    //for (int i = 0; i < testN; i++) {
+    //  std::cout << "  arrayIndex: " << arrayIndices[i];
+    //  std::cout << " gridIndex: " << gridIndices[i];
+    //  std::cout << " pos: " << pos[i].x << " " << pos[i].y << " " << pos[i].z << std::endl;
+    //}
+    //\\\\\\\\\\\\\\\\\\\\\\\\\
+
   // - Naively unroll the loop for finding the start and end indices of each
   //   cell's data pointers in the array of boid indices
+    dim3 fullGrids((pow(gridSideCount, 3) + blockSize - 1) / blockSize);
+    kernResetIntBuffer << <fullGrids, threadsPerBlock >> >(
+        numObjects, dev_gridCellStartIndices, -1);
+    kernResetIntBuffer << <fullGrids, threadsPerBlock >> >(
+        numObjects, dev_gridCellEndIndices, -1);
+
+    kernIdentifyCellStartEnd << <fullBlocksPerGrid, threadsPerBlock >> >(
+        numObjects, dev_particleGridIndices,    
+        dev_gridCellStartIndices, dev_gridCellEndIndices);
+    
   // - Perform velocity updates using neighbor search
+    kernUpdateVelNeighborSearchScattered<< <fullBlocksPerGrid, threadsPerBlock >> >( 
+        numObjects, gridSideCount, gridMinimum,
+        gridInverseCellWidth, gridCellWidth,
+        dev_gridCellStartIndices, dev_gridCellEndIndices,
+        dev_particleArrayIndices,
+        dev_pos, dev_vel1, dev_vel2);
   // - Update positions
+    kernUpdatePos << < fullBlocksPerGrid, threadsPerBlock >> >(
+        numObjects, dt, dev_pos, dev_vel2);
   // - Ping-pong buffers as needed
+    glm::vec3 *swap = dev_vel1;
+    dev_vel1 = dev_vel2;
+    dev_vel2 = swap;
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
   // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
+    dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
   // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
   // In Parallel:
   // - Label each particle with its array index as well as its grid index.
   //   Use 2x width grids
+    kernComputeIndices << <fullBlocksPerGrid, threadsPerBlock >> > (numObjects,
+        gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos,
+        dev_particleArrayIndices, dev_particleGridIndices);
   // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
   //   are welcome to do a performance comparison.
+
+    thrust::device_ptr<int> thrust_particleArrayIndices(dev_particleArrayIndices);
+    thrust::device_ptr<int> thrust_particleGridIndices(dev_particleGridIndices);
+    thrust::sort_by_key(
+        thrust_particleGridIndices,
+        thrust_particleGridIndices + numObjects,
+        thrust_particleArrayIndices);
+
+    dim3 fullGrids((pow(gridSideCount, 3) + blockSize - 1) / blockSize);
+    kernResetIntBuffer << <fullGrids, threadsPerBlock >> >(
+        numObjects, dev_gridCellStartIndices, -1);
+    kernResetIntBuffer << <fullGrids, threadsPerBlock >> >(
+        numObjects, dev_gridCellEndIndices, -1);
+
   // - Naively unroll the loop for finding the start and end indices of each
+    kernIdentifyCellStartEnd << <fullBlocksPerGrid, threadsPerBlock >> >(
+        numObjects, dev_particleGridIndices,    
+        dev_gridCellStartIndices, dev_gridCellEndIndices);
   //   cell's data pointers in the array of boid indices
   // - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
   //   the particle data in the simulation array.
   //   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
+
+	kernSortPosVel << <fullBlocksPerGrid, threadsPerBlock >> >(
+		dev_particleArrayIndices, numObjects, dev_sortedPos, dev_sortedVel,
+		dev_pos, dev_vel1);
+
   // - Perform velocity updates using neighbor search
+    kernUpdateVelNeighborSearchCoherent<< <fullBlocksPerGrid, threadsPerBlock >> >( 
+        numObjects, gridSideCount, gridMinimum,
+        gridInverseCellWidth, gridCellWidth,
+        dev_gridCellStartIndices, dev_gridCellEndIndices,
+        dev_sortedPos, dev_sortedVel, dev_vel2);
   // - Update positions
+    kernUpdatePos << < fullBlocksPerGrid, threadsPerBlock >> >(
+        numObjects, dt, dev_sortedPos, dev_vel2);
   // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+    glm::vec3 *swap = dev_pos;
+	dev_pos = dev_sortedPos;
+	dev_sortedPos = swap;
+
+    swap = dev_vel1;
+    dev_vel1 = dev_vel2;
+    dev_vel2 = swap;
 }
 
 void Boids::endSimulation() {
@@ -390,6 +762,7 @@ void Boids::endSimulation() {
   cudaFree(dev_pos);
 
   // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+  cudaFree(dev_sortedPos);
 }
 
 void Boids::unitTest() {
diff --git a/src/main.cpp b/src/main.cpp
index e416836..c8cd138 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -14,18 +14,41 @@
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
+#define UNIFORM_GRID 1
 #define COHERENT_GRID 0
+#define fmin min
+#define fmax max
+#define DEBUG 0
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
-const int N_FOR_VIS = 5000;
-const float DT = 0.2f;
+const int N_FOR_VIS = 50000;
+const float DT = 0.1f;
+
+//////////////////////
+//glm::vec3 posToGrid(glm::vec3 pos, glm::vec3 gridMin, float inverseCellWidth, int gridResolution) {
+//  glm::vec3 grid((pos - gridMin) * inverseCellWidth);
+//  grid = glm::floor(grid);
+//  if ((int)grid.x == gridResolution) grid.x = 0;
+//  if ((int)grid.y == gridResolution) grid.y = 0;
+//  if ((int)grid.z == gridResolution) grid.z = 0;
+//  return grid;
+//}
+//////////////////////
+
 
 /**
 * C main function.
 */
 int main(int argc, char* argv[]) {
-  projectName = "565 CUDA Intro: Boids";
+	projectName = "565 CUDA Intro: Boids";
+
+  ////////////
+  //glm::vec3 gridMin(0);
+  //glm::vec3 grid = glm::normalize(gridMin);
+  //printf("grid: %d, %d, %d", grid.x, grid.y, grid.z);
+  //int * crash;
+  //int x = crash[0];
+  ////////////
 
   if (init(argc, argv)) {
     mainLoop();
@@ -69,6 +92,7 @@ bool init(int argc, char **argv) {
 
   // Window setup stuff
   glfwSetErrorCallback(errorCallback);
+#define checkCUDAMallocError(msg) checkCUDAError(std::string("cudaMalloc ") + msg, __LINE__)
 
   if (!glfwInit()) {
     std::cout
@@ -202,7 +226,7 @@ void initShaders(GLuint * program) {
     #if UNIFORM_GRID && COHERENT_GRID
     Boids::stepSimulationCoherentGrid(DT);
     #elif UNIFORM_GRID
-    Boids::stepSimulationScatteredGrid(DT);
+	Boids::stepSimulationScatteredGrid(DT);
     #else
     Boids::stepSimulationNaive(DT);
     #endif
@@ -258,6 +282,9 @@ void initShaders(GLuint * program) {
 
       glfwSwapBuffers(window);
       #endif
+      if (DEBUG) {
+        break;
+      }
     }
     glfwDestroyWindow(window);
     glfwTerminate();