diff --git a/Performance Data.xlsx b/Performance Data.xlsx
new file mode 100644
index 0000000..138d8af
Binary files /dev/null and b/Performance Data.xlsx differ
diff --git a/README.md b/README.md
index 98dd9a8..948a223 100644
--- a/README.md
+++ b/README.md
@@ -1,10 +1,156 @@
**University of Pennsylvania, CIS 565: GPU Programming and Architecture,
Project 1 - Flocking**
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Michael Willett
+* Tested on: Windows 10, I5-4690k @ 3.50GHz 8.00GB, GTX 750-TI 2GB (Personal Computer)
-### (TODO: Your README)
+## Contents
+1. [Introduction](#intro)
+2. [Naive Solution](#part1)
+3. [Neighborhood Search](#part2)
+4. [Performance Analysis](#part3)
+5. [Development Process](#part4)
+6. [Build Instructions](#appendix)
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+
+
+## Introduction: Flocking Simulation
+This project explores introductory concepts of GPU paralization methods for simulating flocking behaviors
+of simple particles known as boids. Boid motion is based off of three rules calculated from nearby particles:
+
+1. *Cohesion* - Boids will move towards the center of mass of nearby boids
+2. *Separation* - Boids will try to maintain a minimum distance from one another to avoid collision
+3. *Alignment* - Boids in a group will try to align vector headings with those in the group
+
+These simple rules with the appropriate tuning parameter set can lead to a surprisingly complex emergent
+behavior very similar to how schools of fish or flocks of birds move in nature, as seen below.
+
+
+*Boid Flocking Simulation*
+
+
+## Section 1: Naive Solution
+The boids flocking simulation is naively calculated by comparing euclidean distance from the current
+boid to every other boid in the simulation, and checking if the distance is within the desired range for
+the rule being calculated (we use a smaller distance metric for calculating separation, otherwise the boids
+never exhibit flocking behavior).
+
+While computationally this results in the correct behavior, it can be wasteful in the number of comparison operations
+since we only apply position and velocity updates if a boid has at least one other particle close to it. For smaller
+particle counts, this method can achieve 60 fps on a cheap modern process, but scales very poorly as the number of
+comparisons increases. Detailed analysis is available in [Section 3: Performance Analysis](#part-3).
+
+
+## Section 2: Neighborhood Search
+To improve the performance on the naive solution, the physical volue can be descretized into a unform grid, with each
+voxel index linking to a list of boids contained inside each cell. This approach include some additionaly computation
+overhead to properly sort each boid into its correct bin, but limits the search for neighboring boids to only those
+contained in nearby cells. If the the descritization is set such that the cell edge size is twice the effective distance
+for velocity updates, the neighborhood search is limited to boids contained in at most eight cells.
+
+In particular, this approach has two major additional steps, of which both are easily unwrapped into independent parallel
+processes. The first step is to assign each boid to a cell, which is fortunately just a computation based off of its
+current position and the grid resolution, so overhead is minimal.
+
+The second major step is to sort all boids by their cell index, and compute a list of all boids within each cell. CUDA
+fortunately provides a very fast unstable sort operation. Once the boids are sorted by cell index, all boids in each cell
+can be sorted in parallel solely by identifing transitions between cells in the sorted array.
+
+[Section 3: Performance Analysis](#part-3) outlines the significant performance improvements over the naive solution. It also
+shows a small improvement by adding an additional overhead step to sort the position and velocity arrays directly, rather
+than doing an array lookup for each update step. The idea behind this improvement is to remove the array lookup time required
+when running a loop over all the boids in cell for a single kernel instance, and address the position/velocity vectors directly.
+This extra step is refered to as a coherent array, while the unsorted position/velocity implementation will be called the
+scattered array.
+
+
+## Section 3: Performance Analysis
+*All benchmarks were formed with a cohesion radius of 5 units, separation radius of 3 units, and alignment raidus of 5 units.
+Performance was tested relative to the number of boids in a cube with edge length 100 units.*
+
+Reducing the search space with a uniform grid shows significant improvements over the naive solution. Experiments show that
+this implementation scales exponentially, as expected since complexity os O(n^2). The uniform grid implementation shows much
+better scaling as the number of boids increases, with the most significant improvement at 30,000 boids, which is the largest
+amount tested. The additional position/velocity sorting shows 20% to 30% improvement in tests with significantly more boids
+than cells.
+
+
+
+
+## Section 4: Development Process
+Development of the coherent grid involved several development bugs. The first several implementation steps involved sorting
+both the position and velocity arrays directly with replacement on the original array. This showed an interesting artifact
+where the general boid flocking behavior was mostly intact in two dimensions, but boids seemed to align along the third
+(shown below).
+
+
+*Failure From Overwiting initial Position/Velocity Arrays*
+
+Originally I had determined that the cause was due to sorting the position and velocity frames prior to binning the boids
+into their respective cells. Switching the order of these steps at first glance seems to solve the behavior. However, after
+observing the behavior for longer time periods, all boids seem to develop the same uniform velocity regardless of their
+distance to neighboring flocks.
+
+It was finally determined that the route cause was that in the kernel invocation, the boid being tested must come from the
+unsorted array, while the boids the current index is being tested against must come from the sorted arrays. This final
+implementation results in up to a 20% improvement depending on density of the boids in the simulation.
+
+
+## Appendix: Build Instructions
+
+* `src/` contains the source code.
+* `external/` contains the binaries and headers for GLEW and GLFW.
+
+**CMake note:** Do not change any build settings or add any files to your
+project directly (in Visual Studio, Nsight, etc.) Instead, edit the
+`src/CMakeLists.txt` file. Any files you add must be added here. If you edit it,
+just rebuild your VS/Nsight project to make it update itself.
+
+#### Windows
+
+1. In Git Bash, navigate to your cloned project directory.
+2. Create a `build` directory: `mkdir build`
+ * (This "out-of-source" build makes it easy to delete the `build` directory
+ and try again if something goes wrong with the configuration.)
+3. Navigate into that directory: `cd build`
+4. Open the CMake GUI to configure the project:
+ * `cmake-gui ..` or `"C:\Program Files (x86)\cmake\bin\cmake-gui.exe" ..`
+ * Don't forget the `..` part!
+ * Make sure that the "Source" directory is like
+ `.../Project1-CUDA-Flocking`.
+ * Click *Configure*. Select your version of Visual Studio, Win64.
+ (**NOTE:** you must use Win64, as we don't provide libraries for Win32.)
+ * If you see an error like `CUDA_SDK_ROOT_DIR-NOTFOUND`,
+ set `CUDA_SDK_ROOT_DIR` to your CUDA install path. This will be something
+ like: `C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v7.5`
+ * Click *Generate*.
+5. If generation was successful, there should now be a Visual Studio solution
+ (`.sln`) file in the `build` directory that you just created. Open this.
+ (from the command line: `explorer *.sln`)
+6. Build. (Note that there are Debug and Release configuration options.)
+7. Run. Make sure you run the `cis565_` target (not `ALL_BUILD`) by
+ right-clicking it and selecting "Set as StartUp Project".
+ * If you have switchable graphics (NVIDIA Optimus), you may need to force
+ your program to run with only the NVIDIA card. In NVIDIA Control Panel,
+ under "Manage 3D Settings," set "Multi-display/Mixed GPU acceleration"
+ to "Single display performance mode".
+
+#### OS X & Linux
+
+It is recommended that you use Nsight.
+
+1. Open Nsight. Set the workspace to the one *containing* your cloned repo.
+2. *File->Import...->General->Existing Projects Into Workspace*.
+ * Select the Project 0 repository as the *root directory*.
+3. Select the *cis565-* project in the Project Explorer. From the *Project*
+ menu, select *Build All*.
+ * For later use, note that you can select various Debug and Release build
+ configurations under *Project->Build Configurations->Set Active...*.
+4. If you see an error like `CUDA_SDK_ROOT_DIR-NOTFOUND`:
+ * In a terminal, navigate to the build directory, then run: `cmake-gui ..`
+ * Set `CUDA_SDK_ROOT_DIR` to your CUDA install path.
+ This will be something like: `/usr/local/cuda`
+ * Click *Configure*, then *Generate*.
+5. Right click and *Refresh* the project.
+6. From the *Run* menu, *Run*. Select "Local C/C++ Application" and the
+ `cis565_` binary.
diff --git a/images/coherent_5000.gif b/images/coherent_5000.gif
new file mode 100644
index 0000000..4f5d9c2
Binary files /dev/null and b/images/coherent_5000.gif differ
diff --git a/images/naive_5000.gif b/images/naive_5000.gif
new file mode 100644
index 0000000..1b40585
Binary files /dev/null and b/images/naive_5000.gif differ
diff --git a/images/performance.png b/images/performance.png
new file mode 100644
index 0000000..6ace735
Binary files /dev/null and b/images/performance.png differ
diff --git a/images/uniform_5000.gif b/images/uniform_5000.gif
new file mode 100644
index 0000000..049ae12
Binary files /dev/null and b/images/uniform_5000.gif differ
diff --git a/images/uniform_5000_fail1.gif b/images/uniform_5000_fail1.gif
new file mode 100644
index 0000000..221a1e1
Binary files /dev/null and b/images/uniform_5000_fail1.gif differ
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index fdd636d..dff0113 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -10,5 +10,5 @@ set(SOURCE_FILES
cuda_add_library(src
${SOURCE_FILES}
- OPTIONS -arch=sm_20
+ OPTIONS -arch=sm_50
)
diff --git a/src/kernel.cu b/src/kernel.cu
index 30356b9..fd4416a 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -5,6 +5,7 @@
#include
#include "utilityCore.hpp"
#include "kernel.h"
+#include "device_launch_parameters.h"
// LOOK-2.1 potentially useful for doing grid-based neighbor search
#ifndef imax
@@ -15,6 +16,14 @@
#define imin( a, b ) ( ((a) < (b)) ? (a) : (b) )
#endif
+#ifndef clamp
+#define clamp(x, lo, hi) (x < lo) ? lo : (x > hi) ? hi : x
+#endif
+
+#ifndef wrap
+#define wrap(x, lo, hi) (x < lo) ? x + (hi - lo) : (x > hi) ? x - (hi - lo) : x
+#endif
+
#define checkCUDAErrorWithLine(msg) checkCUDAError(msg, __LINE__)
/**
@@ -54,6 +63,8 @@ void checkCUDAError(const char *msg, int line = -1) {
/*! Size of the starting area in simulation space. */
#define scene_scale 100.0f
+//#define gridFactor 2.0f
+
/***********************************************
* Kernel state (pointers are device pointers) *
***********************************************/
@@ -85,6 +96,8 @@ int *dev_gridCellEndIndices; // to this cell?
// TODO-2.3 - consider what additional buffers you might need to reshuffle
// the position and velocity data to be coherent within cells.
+glm::vec3 *dev_pos_sorted;
+glm::vec3 *dev_vel_sorted;
// LOOK-2.1 - Grid parameters based on simulation parameters.
// These are automatically computed for you in Boids::initSimulation
@@ -147,17 +160,17 @@ void Boids::initSimulation(int N) {
cudaMalloc((void**)&dev_vel1, N * sizeof(glm::vec3));
checkCUDAErrorWithLine("cudaMalloc dev_vel1 failed!");
-
+
cudaMalloc((void**)&dev_vel2, N * sizeof(glm::vec3));
checkCUDAErrorWithLine("cudaMalloc dev_vel2 failed!");
-
+
// LOOK-1.2 - This is a typical CUDA kernel invocation.
kernGenerateRandomPosArray<<>>(1, numObjects,
dev_pos, scene_scale);
checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
// LOOK-2.1 computing grid params
- gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+ gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
int halfSideCount = (int)(scene_scale / gridCellWidth) + 1;
gridSideCount = 2 * halfSideCount;
@@ -169,6 +182,24 @@ void Boids::initSimulation(int N) {
gridMinimum.z -= halfGridWidth;
// TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+ cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+ checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+ cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+ checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+
+ cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+ checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+
+ cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+ checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+ cudaMalloc((void**)&dev_pos_sorted, N * sizeof(glm::vec3));
+ checkCUDAErrorWithLine("cudaMalloc dev_pos_sorted failed!");
+
+ cudaMalloc((void**)&dev_vel_sorted, N * sizeof(glm::vec3));
+ checkCUDAErrorWithLine("cudaMalloc dev_vel_sorted failed!");
+
cudaThreadSynchronize();
}
@@ -223,6 +254,38 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
* stepSimulation *
******************/
+
+/**
+* Helper function to compute unscaled velocity update for a single pair of boids
+*/
+__device__ int computeVelocityChangePair(
+ const glm::vec3 &pos1, const glm::vec3 &vel1,
+ const glm::vec3 &pos2, const glm::vec3 &vel2,
+ glm::vec3 &cen, glm::vec3 &sep, glm::vec3 &coh ) {
+
+ int n = 0;
+ glm::vec3 dv;
+
+ // RULE 1: Move to center of mass
+ float dist = glm::length(pos2 - pos1);
+ if (dist < rule1Distance) {
+ cen += pos2;
+ n++;
+ }
+
+ // RULE 2: Maintain minimum distance from neighbors
+ if (dist < rule2Distance) {
+ sep -= (pos2 - pos1);
+ }
+
+ // RULE 3: Align Velocities
+ if (dist < rule3Distance) {
+ coh += vel2;
+ }
+
+ return n;
+}
+
/**
* LOOK-1.2 You can use this as a helper for kernUpdateVelocityBruteForce.
* __device__ code can be called from a __global__ context
@@ -230,10 +293,25 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
* in the `pos` and `vel` arrays.
*/
__device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
- // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
- // Rule 2: boids try to stay a distance d away from each other
- // Rule 3: boids try to match the speed of surrounding boids
- return glm::vec3(0.0f, 0.0f, 0.0f);
+
+ glm::vec3 dv = glm::vec3(0.0);
+ glm::vec3 center = glm::vec3(0.0);
+ glm::vec3 separate = glm::vec3(0.0);
+ glm::vec3 cohesion = glm::vec3(0.0);
+ float nBoids = 0;
+
+ for(int i = 0; i < N; i++) {
+ if (i != iSelf) {
+ nBoids += computeVelocityChangePair(pos[iSelf], vel[iSelf], pos[i], vel[i], center, separate, cohesion) ;
+ }
+ }
+
+ if (nBoids > 0) {
+ center /= nBoids;
+ dv = (center - pos[iSelf]) * rule1Scale + cohesion * rule3Scale + separate * rule2Scale;
+ }
+
+ return dv;
}
/**
@@ -245,6 +323,18 @@ __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
// Compute a new velocity based on pos and vel1
// Clamp the speed
// Record the new velocity into vel2. Question: why NOT vel1?
+
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (index >= N) {
+ return;
+ }
+
+ glm::vec3 dv = computeVelocityChange(N, index, pos, vel1);
+ glm::vec3 newVel = glm::vec3(0.0);
+
+ newVel = vel2[index] + dv;
+ float speed = glm::length(newVel);
+ vel2[index] = (speed <= maxSpeed) ? newVel : newVel * maxSpeed / speed;
}
/**
@@ -261,13 +351,14 @@ __global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) {
thisPos += vel[index] * dt;
// Wrap the boids around so we don't lose them
- thisPos.x = thisPos.x < -scene_scale ? scene_scale : thisPos.x;
- thisPos.y = thisPos.y < -scene_scale ? scene_scale : thisPos.y;
- thisPos.z = thisPos.z < -scene_scale ? scene_scale : thisPos.z;
+ float wrapScale = 1.0f;
+ thisPos.x = thisPos.x < -scene_scale ? scene_scale * wrapScale : thisPos.x;
+ thisPos.y = thisPos.y < -scene_scale ? scene_scale * wrapScale : thisPos.y;
+ thisPos.z = thisPos.z < -scene_scale ? scene_scale * wrapScale : thisPos.z;
- thisPos.x = thisPos.x > scene_scale ? -scene_scale : thisPos.x;
- thisPos.y = thisPos.y > scene_scale ? -scene_scale : thisPos.y;
- thisPos.z = thisPos.z > scene_scale ? -scene_scale : thisPos.z;
+ thisPos.x = thisPos.x > scene_scale ? -scene_scale * wrapScale : thisPos.x;
+ thisPos.y = thisPos.y > scene_scale ? -scene_scale * wrapScale : thisPos.y;
+ thisPos.z = thisPos.z > scene_scale ? -scene_scale * wrapScale : thisPos.z;
pos[index] = thisPos;
}
@@ -285,10 +376,22 @@ __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
__global__ void kernComputeIndices(int N, int gridResolution,
glm::vec3 gridMin, float inverseCellWidth,
glm::vec3 *pos, int *indices, int *gridIndices) {
- // TODO-2.1
- // - Label each boid with the index of its grid cell.
- // - Set up a parallel array of integer indices as pointers to the actual
- // boid data in pos and vel1/vel2
+ // TODO-2.1
+ // - Label each boid with the index of its grid cell.
+ // - Set up a parallel array of integer indices as pointers to the actual
+ // boid data in pos and vel1/vel2
+
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (index >= N) {
+ return;
+ }
+
+ // calculate the index from grid dimensions
+ glm::vec3 temp = (pos[index] - gridMin) * inverseCellWidth;
+ gridIndices[index] = gridIndex3Dto1D((int) temp.x, (int) temp.y, (int) temp.z, gridResolution);
+
+ // for sorting a dictionary
+ indices[index] = index;
}
// LOOK-2.1 Consider how this could be useful for indicating that a cell
@@ -306,8 +409,73 @@ __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
// Identify the start point of each cell in the gridIndices array.
// This is basically a parallel unrolling of a loop that goes
// "this index doesn't match the one before it, must be a new cell!"
+
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (index >= N) {
+ return;
+ }
+
+ int Cell = particleGridIndices[index];
+
+ if (index <= 0) {
+ gridCellStartIndices[Cell] = index;
+
+ if (particleGridIndices[index + 1] != Cell)
+ gridCellEndIndices[Cell] = index;
+ } else if (index >= (N-1)) {
+ if (particleGridIndices[index - 1] != Cell)
+ gridCellStartIndices[Cell] = index;
+
+ gridCellEndIndices[Cell] = index;
+ } else {
+ if (particleGridIndices[index - 1] != Cell)
+ gridCellStartIndices[Cell] = index;
+
+ if (particleGridIndices[index + 1] != Cell)
+ gridCellEndIndices[Cell] = index;
+ }
+
+}
+
+/**
+* Helper function to find neighboring grid cells to search
+*/
+__device__ void computeNeighborList(int gridResolution, float cellWidth, glm::vec3 &pos, int *Neighbors ) {
+
+ int gridCount = gridResolution*gridResolution*gridResolution;
+
+ // x-axis
+ if (std::fmod(pos.x, cellWidth) < cellWidth/2) {
+ Neighbors[1] = wrap(Neighbors[0] - 1, 0, gridCount);
+ } else {
+ Neighbors[1] = wrap(Neighbors[0] + 1, 0, gridCount);
+ }
+
+ // y-axis
+ if (std::fmod(pos.y, cellWidth) < cellWidth/2) {
+ Neighbors[2] = wrap(Neighbors[0] - gridResolution, 0, gridCount);
+ Neighbors[3] = wrap(Neighbors[1] - gridResolution, 0, gridCount);
+ } else {
+ Neighbors[2] = wrap(Neighbors[0] + gridResolution, 0, gridCount);
+ Neighbors[3] = wrap(Neighbors[1] + gridResolution, 0, gridCount);
+ }
+
+ // z-axis
+ if (std::fmod(pos.z, cellWidth) < cellWidth/2) {
+ Neighbors[4] = wrap(Neighbors[0] - gridResolution*gridResolution, 0, gridCount);
+ Neighbors[5] = wrap(Neighbors[1] - gridResolution*gridResolution, 0, gridCount);
+ Neighbors[6] = wrap(Neighbors[2] - gridResolution*gridResolution, 0, gridCount);
+ Neighbors[7] = wrap(Neighbors[3] - gridResolution*gridResolution, 0, gridCount);
+ } else {
+ Neighbors[4] = wrap(Neighbors[0] + gridResolution*gridResolution, 0, gridCount);
+ Neighbors[5] = wrap(Neighbors[1] + gridResolution*gridResolution, 0, gridCount);
+ Neighbors[6] = wrap(Neighbors[2] + gridResolution*gridResolution, 0, gridCount);
+ Neighbors[7] = wrap(Neighbors[3] + gridResolution*gridResolution, 0, gridCount);
+ }
+
}
+
__global__ void kernUpdateVelNeighborSearchScattered(
int N, int gridResolution, glm::vec3 gridMin,
float inverseCellWidth, float cellWidth,
@@ -322,13 +490,52 @@ __global__ void kernUpdateVelNeighborSearchScattered(
// - Access each boid in the cell and compute velocity change from
// the boids rules, if this boid is within the neighborhood distance.
// - Clamp the speed change before putting the new speed in vel2
+
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (index >= N) {
+ return;
+ }
+
+ // Current Cell Index
+ glm::vec3 temp = (pos[index] - gridMin) * inverseCellWidth;
+ int CurrentCell = gridIndex3Dto1D((int) temp.x, (int) temp.y, (int) temp.z, gridResolution);
+
+ // Neighboring Cells
+ int Neighbors[8] = {CurrentCell};
+ computeNeighborList(gridResolution, cellWidth, pos[index], Neighbors);
+
+ // check all boids in neighboring cells
+ glm::vec3 dv = glm::vec3(0.0);
+ glm::vec3 center = glm::vec3(0.0);
+ glm::vec3 separate = glm::vec3(0.0);
+ glm::vec3 cohesion = glm::vec3(0.0);
+ float nBoids = 0.0;
+
+ for (int i = 0; i < 8; i++) {
+ if (gridCellStartIndices[Neighbors[i]] != -1) {
+ for (int j = gridCellStartIndices[Neighbors[i]]; j <= gridCellEndIndices[Neighbors[i]]; j++) {
+ nBoids += computeVelocityChangePair(pos[index], vel1[index], pos[particleArrayIndices[j]], vel1[particleArrayIndices[j]], center, separate, cohesion);
+ }
+ }
+ }
+
+ if (nBoids > 0) {
+ center /= nBoids;
+ dv = (center - pos[index]) * rule1Scale + cohesion * rule3Scale + separate * rule2Scale;
+ }
+
+ // calculate new velocity and clamp
+ glm::vec3 newVel = vel2[index] + dv;
+ float speed = glm::length(newVel);
+ vel2[index] = (speed <= maxSpeed) ? newVel : newVel * maxSpeed / speed;
}
__global__ void kernUpdateVelNeighborSearchCoherent(
int N, int gridResolution, glm::vec3 gridMin,
float inverseCellWidth, float cellWidth,
int *gridCellStartIndices, int *gridCellEndIndices,
- glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
+ glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2,
+ glm::vec3 *pos_sort, glm::vec3 *vel_sort){
// TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
// except with one less level of indirection.
// This should expect gridCellStartIndices and gridCellEndIndices to refer
@@ -341,6 +548,54 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
// - Access each boid in the cell and compute velocity change from
// the boids rules, if this boid is within the neighborhood distance.
// - Clamp the speed change before putting the new speed in vel2
+
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (index >= N) {
+ return;
+ }
+
+ // Current Cell Index
+ glm::vec3 temp = (pos[index] - gridMin) * inverseCellWidth;
+ int CurrentCell = gridIndex3Dto1D((int) temp.x, (int) temp.y, (int) temp.z, gridResolution);
+
+ // Neighboring Cells
+ int Neighbors[8] = {CurrentCell};
+ computeNeighborList(gridResolution, cellWidth, pos[index], Neighbors);
+
+ // check all boids in neighboring cells
+ glm::vec3 dv = glm::vec3(0.0);
+ glm::vec3 center = glm::vec3(0.0);
+ glm::vec3 separate = glm::vec3(0.0);
+ glm::vec3 cohesion = glm::vec3(0.0);
+ float nBoids = 0.0;
+
+ for (int i = 0; i < 8; i++) {
+ if (gridCellStartIndices[Neighbors[i]] != -1) {
+ for (int j = gridCellStartIndices[Neighbors[i]]; j <= gridCellEndIndices[Neighbors[i]]; j++) {
+ nBoids += computeVelocityChangePair(pos[index], vel1[index], pos_sort[j], vel_sort[j], center, separate, cohesion);
+ }
+ }
+ }
+
+ if (nBoids > 0) {
+ center /= nBoids;
+ dv = (center - pos[index]) * rule1Scale + cohesion * rule3Scale + separate * rule2Scale;
+ }
+
+ // calculate new velocity and clamp
+ glm::vec3 newVel = vel2[index] + dv;
+ float speed = glm::length(newVel);
+ vel2[index] = (speed <= maxSpeed) ? newVel : newVel * maxSpeed / speed;
+}
+
+// Use this to pull velocity and position data out of the Wrapper container
+__global__ void kernSortArray(int N, int *indices, glm::vec3 *vec1, glm::vec3 *vec2) {
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (index >= N) {
+ return;
+ }
+
+ vec2[index] = vec1[indices[index]];
}
/**
@@ -349,6 +604,17 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
void Boids::stepSimulationNaive(float dt) {
// TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
// TODO-1.2 ping-pong the velocity buffers
+
+ dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+ // Swap Velocity Buffers
+ glm::vec3 *temp = dev_vel1;
+ dev_vel1 = dev_vel2;
+ dev_vel2 = temp;
+
+ // Update Velocities and Positions
+ kernUpdateVelocityBruteForce<<>>(numObjects, dev_pos, dev_vel1, dev_vel2);
+ kernUpdatePos<<>>(numObjects, dt, dev_pos, dev_vel2);
}
void Boids::stepSimulationScatteredGrid(float dt) {
@@ -364,6 +630,32 @@ void Boids::stepSimulationScatteredGrid(float dt) {
// - Perform velocity updates using neighbor search
// - Update positions
// - Ping-pong buffers as needed
+
+ dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+ // Swap Velocity Buffers
+ cudaMemcpy(dev_vel1, dev_vel2, numObjects * sizeof(glm::vec3), cudaMemcpyDeviceToDevice);
+
+ // Bin the Boids
+ kernComputeIndices<<>>(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+
+ // Sort
+ thrust::device_ptr dev_thrust_keys(dev_particleGridIndices);
+ thrust::device_ptr dev_thrust_values(dev_particleArrayIndices);
+ thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + numObjects, dev_thrust_values);
+
+ // Calculate boids in each cell
+ kernResetIntBuffer << > >(numObjects, dev_gridCellStartIndices, -1);
+ kernResetIntBuffer << > >(numObjects, dev_gridCellEndIndices, -1);
+ kernIdentifyCellStartEnd << > >(numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+
+ // Update Velocities and Positions
+ kernUpdateVelNeighborSearchScattered<<>>(
+ numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth,
+ dev_gridCellStartIndices, dev_gridCellEndIndices, dev_particleArrayIndices,
+ dev_pos, dev_vel1, dev_vel2);
+
+ kernUpdatePos<<>>(numObjects, dt, dev_pos, dev_vel2);
}
void Boids::stepSimulationCoherentGrid(float dt) {
@@ -382,6 +674,36 @@ void Boids::stepSimulationCoherentGrid(float dt) {
// - Perform velocity updates using neighbor search
// - Update positions
// - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+
+
+ dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+ // Swap Velocity Buffers
+ cudaMemcpy(dev_vel1, dev_vel2, numObjects * sizeof(glm::vec3), cudaMemcpyDeviceToDevice);
+
+ // Bin the Boids
+ kernComputeIndices<<>>(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+
+ // Sort
+ thrust::device_ptr dev_thrust_keys(dev_particleGridIndices);
+ thrust::device_ptr dev_thrust_values(dev_particleArrayIndices);
+ thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + numObjects, dev_thrust_values);
+
+ // Reoder arrays
+ kernSortArray << > >(numObjects, dev_particleArrayIndices, dev_vel1, dev_vel_sorted);
+ kernSortArray << > >(numObjects, dev_particleArrayIndices, dev_pos, dev_pos_sorted);
+
+ // Calculate boids in each cell
+ kernResetIntBuffer << > >(numObjects, dev_gridCellStartIndices, -1);
+ kernResetIntBuffer << > >(numObjects, dev_gridCellEndIndices, -1);
+ kernIdentifyCellStartEnd << > >(numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+
+ // Update Velocities and Positions
+ kernUpdateVelNeighborSearchCoherent<<>>(
+ numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth,
+ dev_gridCellStartIndices, dev_gridCellEndIndices, dev_pos, dev_vel1, dev_vel2, dev_pos_sorted, dev_vel_sorted);
+
+ kernUpdatePos<<>>(numObjects, dt, dev_pos, dev_vel2);
}
void Boids::endSimulation() {
@@ -390,6 +712,12 @@ void Boids::endSimulation() {
cudaFree(dev_pos);
// TODO-2.1 TODO-2.3 - Free any additional buffers here.
+ cudaFree(dev_particleArrayIndices);
+ cudaFree(dev_particleGridIndices);
+ cudaFree(dev_gridCellStartIndices);
+ cudaFree(dev_gridCellEndIndices);
+ cudaFree(dev_pos_sorted);
+ cudaFree(dev_vel_sorted);
}
void Boids::unitTest() {
diff --git a/src/main.cpp b/src/main.cpp
index e416836..ec1a3ea 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -14,8 +14,16 @@
// LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
#define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
+
+// fix the fact that VC++ 2010 is not C99 compliant
+#if defined(_WIN32) || defined(_WIN64)
+#define fmax max
+#define fmin min
+#pragma warning (disable:4996)
+#define snprintf sprintf_s
+#endif
// LOOK-1.2 - change this to adjust particle count in the simulation
const int N_FOR_VIS = 5000;
@@ -220,21 +228,35 @@ void initShaders(GLuint * program) {
double timebase = 0;
int frame = 0;
- Boids::unitTest(); // LOOK-1.2 We run some basic example code to make sure
+ float avgFrame = 0.0f;
+ int totalFrames = 0;
+ double timebase2 = 0;
+
+ //Boids::unitTest(); // LOOK-1.2 We run some basic example code to make sure
// your CUDA development setup is ready to go.
while (!glfwWindowShouldClose(window)) {
glfwPollEvents();
frame++;
+ totalFrames++;
double time = glfwGetTime();
+ // running average time per 100 frames
+ avgFrame = .99*avgFrame + .01*(time - timebase2);
+ timebase2 = time;
+ if (totalFrames % 100 == 0) {
+ printf("Average Execution Time: %f\n", avgFrame);
+ totalFrames = 0;
+ }
+
if (time - timebase > 1.0) {
fps = frame / (time - timebase);
- timebase = time;
- frame = 0;
+ timebase = time;
+ frame = 0;
}
+
runCUDA();
std::ostringstream ss;