diff --git a/Performance Data.xlsx b/Performance Data.xlsx new file mode 100644 index 0000000..138d8af Binary files /dev/null and b/Performance Data.xlsx differ diff --git a/README.md b/README.md index 98dd9a8..948a223 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,156 @@ **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 1 - Flocking** -* (TODO) YOUR NAME HERE -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Michael Willett +* Tested on: Windows 10, I5-4690k @ 3.50GHz 8.00GB, GTX 750-TI 2GB (Personal Computer) -### (TODO: Your README) +## Contents +1. [Introduction](#intro) +2. [Naive Solution](#part1) +3. [Neighborhood Search](#part2) +4. [Performance Analysis](#part3) +5. [Development Process](#part4) +6. [Build Instructions](#appendix) -Include screenshots, analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) + + +## Introduction: Flocking Simulation +This project explores introductory concepts of GPU paralization methods for simulating flocking behaviors +of simple particles known as boids. Boid motion is based off of three rules calculated from nearby particles: + +1. *Cohesion* - Boids will move towards the center of mass of nearby boids +2. *Separation* - Boids will try to maintain a minimum distance from one another to avoid collision +3. *Alignment* - Boids in a group will try to align vector headings with those in the group + +These simple rules with the appropriate tuning parameter set can lead to a surprisingly complex emergent +behavior very similar to how schools of fish or flocks of birds move in nature, as seen below. + +![Flocking Simulation with Grid Pruning](images/uniform_5000.gif) +*Boid Flocking Simulation* + + +## Section 1: Naive Solution +The boids flocking simulation is naively calculated by comparing euclidean distance from the current +boid to every other boid in the simulation, and checking if the distance is within the desired range for +the rule being calculated (we use a smaller distance metric for calculating separation, otherwise the boids +never exhibit flocking behavior). + +While computationally this results in the correct behavior, it can be wasteful in the number of comparison operations +since we only apply position and velocity updates if a boid has at least one other particle close to it. For smaller +particle counts, this method can achieve 60 fps on a cheap modern process, but scales very poorly as the number of +comparisons increases. Detailed analysis is available in [Section 3: Performance Analysis](#part-3). + + +## Section 2: Neighborhood Search +To improve the performance on the naive solution, the physical volue can be descretized into a unform grid, with each +voxel index linking to a list of boids contained inside each cell. This approach include some additionaly computation +overhead to properly sort each boid into its correct bin, but limits the search for neighboring boids to only those +contained in nearby cells. If the the descritization is set such that the cell edge size is twice the effective distance +for velocity updates, the neighborhood search is limited to boids contained in at most eight cells. + +In particular, this approach has two major additional steps, of which both are easily unwrapped into independent parallel +processes. The first step is to assign each boid to a cell, which is fortunately just a computation based off of its +current position and the grid resolution, so overhead is minimal. + +The second major step is to sort all boids by their cell index, and compute a list of all boids within each cell. CUDA +fortunately provides a very fast unstable sort operation. Once the boids are sorted by cell index, all boids in each cell +can be sorted in parallel solely by identifing transitions between cells in the sorted array. + +[Section 3: Performance Analysis](#part-3) outlines the significant performance improvements over the naive solution. It also +shows a small improvement by adding an additional overhead step to sort the position and velocity arrays directly, rather +than doing an array lookup for each update step. The idea behind this improvement is to remove the array lookup time required +when running a loop over all the boids in cell for a single kernel instance, and address the position/velocity vectors directly. +This extra step is refered to as a coherent array, while the unsorted position/velocity implementation will be called the +scattered array. + + +## Section 3: Performance Analysis +*All benchmarks were formed with a cohesion radius of 5 units, separation radius of 3 units, and alignment raidus of 5 units. +Performance was tested relative to the number of boids in a cube with edge length 100 units.* + +Reducing the search space with a uniform grid shows significant improvements over the naive solution. Experiments show that +this implementation scales exponentially, as expected since complexity os O(n^2). The uniform grid implementation shows much +better scaling as the number of boids increases, with the most significant improvement at 30,000 boids, which is the largest +amount tested. The additional position/velocity sorting shows 20% to 30% improvement in tests with significantly more boids +than cells. + +![Performance of Naive vs. Discrete Grid Perofmrance](images/performance.png) + + +## Section 4: Development Process +Development of the coherent grid involved several development bugs. The first several implementation steps involved sorting +both the position and velocity arrays directly with replacement on the original array. This showed an interesting artifact +where the general boid flocking behavior was mostly intact in two dimensions, but boids seemed to align along the third +(shown below). + +![Buffer Overwrite Failure](images/uniform_5000_fail1.gif) +*Failure From Overwiting initial Position/Velocity Arrays* + +Originally I had determined that the cause was due to sorting the position and velocity frames prior to binning the boids +into their respective cells. Switching the order of these steps at first glance seems to solve the behavior. However, after +observing the behavior for longer time periods, all boids seem to develop the same uniform velocity regardless of their +distance to neighboring flocks. + +It was finally determined that the route cause was that in the kernel invocation, the boid being tested must come from the +unsorted array, while the boids the current index is being tested against must come from the sorted arrays. This final +implementation results in up to a 20% improvement depending on density of the boids in the simulation. + + +## Appendix: Build Instructions + +* `src/` contains the source code. +* `external/` contains the binaries and headers for GLEW and GLFW. + +**CMake note:** Do not change any build settings or add any files to your +project directly (in Visual Studio, Nsight, etc.) Instead, edit the +`src/CMakeLists.txt` file. Any files you add must be added here. If you edit it, +just rebuild your VS/Nsight project to make it update itself. + +#### Windows + +1. In Git Bash, navigate to your cloned project directory. +2. Create a `build` directory: `mkdir build` + * (This "out-of-source" build makes it easy to delete the `build` directory + and try again if something goes wrong with the configuration.) +3. Navigate into that directory: `cd build` +4. Open the CMake GUI to configure the project: + * `cmake-gui ..` or `"C:\Program Files (x86)\cmake\bin\cmake-gui.exe" ..` + * Don't forget the `..` part! + * Make sure that the "Source" directory is like + `.../Project1-CUDA-Flocking`. + * Click *Configure*. Select your version of Visual Studio, Win64. + (**NOTE:** you must use Win64, as we don't provide libraries for Win32.) + * If you see an error like `CUDA_SDK_ROOT_DIR-NOTFOUND`, + set `CUDA_SDK_ROOT_DIR` to your CUDA install path. This will be something + like: `C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v7.5` + * Click *Generate*. +5. If generation was successful, there should now be a Visual Studio solution + (`.sln`) file in the `build` directory that you just created. Open this. + (from the command line: `explorer *.sln`) +6. Build. (Note that there are Debug and Release configuration options.) +7. Run. Make sure you run the `cis565_` target (not `ALL_BUILD`) by + right-clicking it and selecting "Set as StartUp Project". + * If you have switchable graphics (NVIDIA Optimus), you may need to force + your program to run with only the NVIDIA card. In NVIDIA Control Panel, + under "Manage 3D Settings," set "Multi-display/Mixed GPU acceleration" + to "Single display performance mode". + +#### OS X & Linux + +It is recommended that you use Nsight. + +1. Open Nsight. Set the workspace to the one *containing* your cloned repo. +2. *File->Import...->General->Existing Projects Into Workspace*. + * Select the Project 0 repository as the *root directory*. +3. Select the *cis565-* project in the Project Explorer. From the *Project* + menu, select *Build All*. + * For later use, note that you can select various Debug and Release build + configurations under *Project->Build Configurations->Set Active...*. +4. If you see an error like `CUDA_SDK_ROOT_DIR-NOTFOUND`: + * In a terminal, navigate to the build directory, then run: `cmake-gui ..` + * Set `CUDA_SDK_ROOT_DIR` to your CUDA install path. + This will be something like: `/usr/local/cuda` + * Click *Configure*, then *Generate*. +5. Right click and *Refresh* the project. +6. From the *Run* menu, *Run*. Select "Local C/C++ Application" and the + `cis565_` binary. diff --git a/images/coherent_5000.gif b/images/coherent_5000.gif new file mode 100644 index 0000000..4f5d9c2 Binary files /dev/null and b/images/coherent_5000.gif differ diff --git a/images/naive_5000.gif b/images/naive_5000.gif new file mode 100644 index 0000000..1b40585 Binary files /dev/null and b/images/naive_5000.gif differ diff --git a/images/performance.png b/images/performance.png new file mode 100644 index 0000000..6ace735 Binary files /dev/null and b/images/performance.png differ diff --git a/images/uniform_5000.gif b/images/uniform_5000.gif new file mode 100644 index 0000000..049ae12 Binary files /dev/null and b/images/uniform_5000.gif differ diff --git a/images/uniform_5000_fail1.gif b/images/uniform_5000_fail1.gif new file mode 100644 index 0000000..221a1e1 Binary files /dev/null and b/images/uniform_5000_fail1.gif differ diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt index fdd636d..dff0113 100644 --- a/src/CMakeLists.txt +++ b/src/CMakeLists.txt @@ -10,5 +10,5 @@ set(SOURCE_FILES cuda_add_library(src ${SOURCE_FILES} - OPTIONS -arch=sm_20 + OPTIONS -arch=sm_50 ) diff --git a/src/kernel.cu b/src/kernel.cu index 30356b9..fd4416a 100644 --- a/src/kernel.cu +++ b/src/kernel.cu @@ -5,6 +5,7 @@ #include #include "utilityCore.hpp" #include "kernel.h" +#include "device_launch_parameters.h" // LOOK-2.1 potentially useful for doing grid-based neighbor search #ifndef imax @@ -15,6 +16,14 @@ #define imin( a, b ) ( ((a) < (b)) ? (a) : (b) ) #endif +#ifndef clamp +#define clamp(x, lo, hi) (x < lo) ? lo : (x > hi) ? hi : x +#endif + +#ifndef wrap +#define wrap(x, lo, hi) (x < lo) ? x + (hi - lo) : (x > hi) ? x - (hi - lo) : x +#endif + #define checkCUDAErrorWithLine(msg) checkCUDAError(msg, __LINE__) /** @@ -54,6 +63,8 @@ void checkCUDAError(const char *msg, int line = -1) { /*! Size of the starting area in simulation space. */ #define scene_scale 100.0f +//#define gridFactor 2.0f + /*********************************************** * Kernel state (pointers are device pointers) * ***********************************************/ @@ -85,6 +96,8 @@ int *dev_gridCellEndIndices; // to this cell? // TODO-2.3 - consider what additional buffers you might need to reshuffle // the position and velocity data to be coherent within cells. +glm::vec3 *dev_pos_sorted; +glm::vec3 *dev_vel_sorted; // LOOK-2.1 - Grid parameters based on simulation parameters. // These are automatically computed for you in Boids::initSimulation @@ -147,17 +160,17 @@ void Boids::initSimulation(int N) { cudaMalloc((void**)&dev_vel1, N * sizeof(glm::vec3)); checkCUDAErrorWithLine("cudaMalloc dev_vel1 failed!"); - + cudaMalloc((void**)&dev_vel2, N * sizeof(glm::vec3)); checkCUDAErrorWithLine("cudaMalloc dev_vel2 failed!"); - + // LOOK-1.2 - This is a typical CUDA kernel invocation. kernGenerateRandomPosArray<<>>(1, numObjects, dev_pos, scene_scale); checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!"); // LOOK-2.1 computing grid params - gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance); + gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance); int halfSideCount = (int)(scene_scale / gridCellWidth) + 1; gridSideCount = 2 * halfSideCount; @@ -169,6 +182,24 @@ void Boids::initSimulation(int N) { gridMinimum.z -= halfGridWidth; // TODO-2.1 TODO-2.3 - Allocate additional buffers here. + cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int)); + checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!"); + + cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int)); + checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!"); + + cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int)); + checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!"); + + cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int)); + checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!"); + + cudaMalloc((void**)&dev_pos_sorted, N * sizeof(glm::vec3)); + checkCUDAErrorWithLine("cudaMalloc dev_pos_sorted failed!"); + + cudaMalloc((void**)&dev_vel_sorted, N * sizeof(glm::vec3)); + checkCUDAErrorWithLine("cudaMalloc dev_vel_sorted failed!"); + cudaThreadSynchronize(); } @@ -223,6 +254,38 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities) * stepSimulation * ******************/ + +/** +* Helper function to compute unscaled velocity update for a single pair of boids +*/ +__device__ int computeVelocityChangePair( + const glm::vec3 &pos1, const glm::vec3 &vel1, + const glm::vec3 &pos2, const glm::vec3 &vel2, + glm::vec3 &cen, glm::vec3 &sep, glm::vec3 &coh ) { + + int n = 0; + glm::vec3 dv; + + // RULE 1: Move to center of mass + float dist = glm::length(pos2 - pos1); + if (dist < rule1Distance) { + cen += pos2; + n++; + } + + // RULE 2: Maintain minimum distance from neighbors + if (dist < rule2Distance) { + sep -= (pos2 - pos1); + } + + // RULE 3: Align Velocities + if (dist < rule3Distance) { + coh += vel2; + } + + return n; +} + /** * LOOK-1.2 You can use this as a helper for kernUpdateVelocityBruteForce. * __device__ code can be called from a __global__ context @@ -230,10 +293,25 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities) * in the `pos` and `vel` arrays. */ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) { - // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves - // Rule 2: boids try to stay a distance d away from each other - // Rule 3: boids try to match the speed of surrounding boids - return glm::vec3(0.0f, 0.0f, 0.0f); + + glm::vec3 dv = glm::vec3(0.0); + glm::vec3 center = glm::vec3(0.0); + glm::vec3 separate = glm::vec3(0.0); + glm::vec3 cohesion = glm::vec3(0.0); + float nBoids = 0; + + for(int i = 0; i < N; i++) { + if (i != iSelf) { + nBoids += computeVelocityChangePair(pos[iSelf], vel[iSelf], pos[i], vel[i], center, separate, cohesion) ; + } + } + + if (nBoids > 0) { + center /= nBoids; + dv = (center - pos[iSelf]) * rule1Scale + cohesion * rule3Scale + separate * rule2Scale; + } + + return dv; } /** @@ -245,6 +323,18 @@ __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos, // Compute a new velocity based on pos and vel1 // Clamp the speed // Record the new velocity into vel2. Question: why NOT vel1? + + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= N) { + return; + } + + glm::vec3 dv = computeVelocityChange(N, index, pos, vel1); + glm::vec3 newVel = glm::vec3(0.0); + + newVel = vel2[index] + dv; + float speed = glm::length(newVel); + vel2[index] = (speed <= maxSpeed) ? newVel : newVel * maxSpeed / speed; } /** @@ -261,13 +351,14 @@ __global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) { thisPos += vel[index] * dt; // Wrap the boids around so we don't lose them - thisPos.x = thisPos.x < -scene_scale ? scene_scale : thisPos.x; - thisPos.y = thisPos.y < -scene_scale ? scene_scale : thisPos.y; - thisPos.z = thisPos.z < -scene_scale ? scene_scale : thisPos.z; + float wrapScale = 1.0f; + thisPos.x = thisPos.x < -scene_scale ? scene_scale * wrapScale : thisPos.x; + thisPos.y = thisPos.y < -scene_scale ? scene_scale * wrapScale : thisPos.y; + thisPos.z = thisPos.z < -scene_scale ? scene_scale * wrapScale : thisPos.z; - thisPos.x = thisPos.x > scene_scale ? -scene_scale : thisPos.x; - thisPos.y = thisPos.y > scene_scale ? -scene_scale : thisPos.y; - thisPos.z = thisPos.z > scene_scale ? -scene_scale : thisPos.z; + thisPos.x = thisPos.x > scene_scale ? -scene_scale * wrapScale : thisPos.x; + thisPos.y = thisPos.y > scene_scale ? -scene_scale * wrapScale : thisPos.y; + thisPos.z = thisPos.z > scene_scale ? -scene_scale * wrapScale : thisPos.z; pos[index] = thisPos; } @@ -285,10 +376,22 @@ __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) { __global__ void kernComputeIndices(int N, int gridResolution, glm::vec3 gridMin, float inverseCellWidth, glm::vec3 *pos, int *indices, int *gridIndices) { - // TODO-2.1 - // - Label each boid with the index of its grid cell. - // - Set up a parallel array of integer indices as pointers to the actual - // boid data in pos and vel1/vel2 + // TODO-2.1 + // - Label each boid with the index of its grid cell. + // - Set up a parallel array of integer indices as pointers to the actual + // boid data in pos and vel1/vel2 + + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= N) { + return; + } + + // calculate the index from grid dimensions + glm::vec3 temp = (pos[index] - gridMin) * inverseCellWidth; + gridIndices[index] = gridIndex3Dto1D((int) temp.x, (int) temp.y, (int) temp.z, gridResolution); + + // for sorting a dictionary + indices[index] = index; } // LOOK-2.1 Consider how this could be useful for indicating that a cell @@ -306,8 +409,73 @@ __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices, // Identify the start point of each cell in the gridIndices array. // This is basically a parallel unrolling of a loop that goes // "this index doesn't match the one before it, must be a new cell!" + + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= N) { + return; + } + + int Cell = particleGridIndices[index]; + + if (index <= 0) { + gridCellStartIndices[Cell] = index; + + if (particleGridIndices[index + 1] != Cell) + gridCellEndIndices[Cell] = index; + } else if (index >= (N-1)) { + if (particleGridIndices[index - 1] != Cell) + gridCellStartIndices[Cell] = index; + + gridCellEndIndices[Cell] = index; + } else { + if (particleGridIndices[index - 1] != Cell) + gridCellStartIndices[Cell] = index; + + if (particleGridIndices[index + 1] != Cell) + gridCellEndIndices[Cell] = index; + } + +} + +/** +* Helper function to find neighboring grid cells to search +*/ +__device__ void computeNeighborList(int gridResolution, float cellWidth, glm::vec3 &pos, int *Neighbors ) { + + int gridCount = gridResolution*gridResolution*gridResolution; + + // x-axis + if (std::fmod(pos.x, cellWidth) < cellWidth/2) { + Neighbors[1] = wrap(Neighbors[0] - 1, 0, gridCount); + } else { + Neighbors[1] = wrap(Neighbors[0] + 1, 0, gridCount); + } + + // y-axis + if (std::fmod(pos.y, cellWidth) < cellWidth/2) { + Neighbors[2] = wrap(Neighbors[0] - gridResolution, 0, gridCount); + Neighbors[3] = wrap(Neighbors[1] - gridResolution, 0, gridCount); + } else { + Neighbors[2] = wrap(Neighbors[0] + gridResolution, 0, gridCount); + Neighbors[3] = wrap(Neighbors[1] + gridResolution, 0, gridCount); + } + + // z-axis + if (std::fmod(pos.z, cellWidth) < cellWidth/2) { + Neighbors[4] = wrap(Neighbors[0] - gridResolution*gridResolution, 0, gridCount); + Neighbors[5] = wrap(Neighbors[1] - gridResolution*gridResolution, 0, gridCount); + Neighbors[6] = wrap(Neighbors[2] - gridResolution*gridResolution, 0, gridCount); + Neighbors[7] = wrap(Neighbors[3] - gridResolution*gridResolution, 0, gridCount); + } else { + Neighbors[4] = wrap(Neighbors[0] + gridResolution*gridResolution, 0, gridCount); + Neighbors[5] = wrap(Neighbors[1] + gridResolution*gridResolution, 0, gridCount); + Neighbors[6] = wrap(Neighbors[2] + gridResolution*gridResolution, 0, gridCount); + Neighbors[7] = wrap(Neighbors[3] + gridResolution*gridResolution, 0, gridCount); + } + } + __global__ void kernUpdateVelNeighborSearchScattered( int N, int gridResolution, glm::vec3 gridMin, float inverseCellWidth, float cellWidth, @@ -322,13 +490,52 @@ __global__ void kernUpdateVelNeighborSearchScattered( // - Access each boid in the cell and compute velocity change from // the boids rules, if this boid is within the neighborhood distance. // - Clamp the speed change before putting the new speed in vel2 + + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= N) { + return; + } + + // Current Cell Index + glm::vec3 temp = (pos[index] - gridMin) * inverseCellWidth; + int CurrentCell = gridIndex3Dto1D((int) temp.x, (int) temp.y, (int) temp.z, gridResolution); + + // Neighboring Cells + int Neighbors[8] = {CurrentCell}; + computeNeighborList(gridResolution, cellWidth, pos[index], Neighbors); + + // check all boids in neighboring cells + glm::vec3 dv = glm::vec3(0.0); + glm::vec3 center = glm::vec3(0.0); + glm::vec3 separate = glm::vec3(0.0); + glm::vec3 cohesion = glm::vec3(0.0); + float nBoids = 0.0; + + for (int i = 0; i < 8; i++) { + if (gridCellStartIndices[Neighbors[i]] != -1) { + for (int j = gridCellStartIndices[Neighbors[i]]; j <= gridCellEndIndices[Neighbors[i]]; j++) { + nBoids += computeVelocityChangePair(pos[index], vel1[index], pos[particleArrayIndices[j]], vel1[particleArrayIndices[j]], center, separate, cohesion); + } + } + } + + if (nBoids > 0) { + center /= nBoids; + dv = (center - pos[index]) * rule1Scale + cohesion * rule3Scale + separate * rule2Scale; + } + + // calculate new velocity and clamp + glm::vec3 newVel = vel2[index] + dv; + float speed = glm::length(newVel); + vel2[index] = (speed <= maxSpeed) ? newVel : newVel * maxSpeed / speed; } __global__ void kernUpdateVelNeighborSearchCoherent( int N, int gridResolution, glm::vec3 gridMin, float inverseCellWidth, float cellWidth, int *gridCellStartIndices, int *gridCellEndIndices, - glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) { + glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2, + glm::vec3 *pos_sort, glm::vec3 *vel_sort){ // TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered, // except with one less level of indirection. // This should expect gridCellStartIndices and gridCellEndIndices to refer @@ -341,6 +548,54 @@ __global__ void kernUpdateVelNeighborSearchCoherent( // - Access each boid in the cell and compute velocity change from // the boids rules, if this boid is within the neighborhood distance. // - Clamp the speed change before putting the new speed in vel2 + + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= N) { + return; + } + + // Current Cell Index + glm::vec3 temp = (pos[index] - gridMin) * inverseCellWidth; + int CurrentCell = gridIndex3Dto1D((int) temp.x, (int) temp.y, (int) temp.z, gridResolution); + + // Neighboring Cells + int Neighbors[8] = {CurrentCell}; + computeNeighborList(gridResolution, cellWidth, pos[index], Neighbors); + + // check all boids in neighboring cells + glm::vec3 dv = glm::vec3(0.0); + glm::vec3 center = glm::vec3(0.0); + glm::vec3 separate = glm::vec3(0.0); + glm::vec3 cohesion = glm::vec3(0.0); + float nBoids = 0.0; + + for (int i = 0; i < 8; i++) { + if (gridCellStartIndices[Neighbors[i]] != -1) { + for (int j = gridCellStartIndices[Neighbors[i]]; j <= gridCellEndIndices[Neighbors[i]]; j++) { + nBoids += computeVelocityChangePair(pos[index], vel1[index], pos_sort[j], vel_sort[j], center, separate, cohesion); + } + } + } + + if (nBoids > 0) { + center /= nBoids; + dv = (center - pos[index]) * rule1Scale + cohesion * rule3Scale + separate * rule2Scale; + } + + // calculate new velocity and clamp + glm::vec3 newVel = vel2[index] + dv; + float speed = glm::length(newVel); + vel2[index] = (speed <= maxSpeed) ? newVel : newVel * maxSpeed / speed; +} + +// Use this to pull velocity and position data out of the Wrapper container +__global__ void kernSortArray(int N, int *indices, glm::vec3 *vec1, glm::vec3 *vec2) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= N) { + return; + } + + vec2[index] = vec1[indices[index]]; } /** @@ -349,6 +604,17 @@ __global__ void kernUpdateVelNeighborSearchCoherent( void Boids::stepSimulationNaive(float dt) { // TODO-1.2 - use the kernels you wrote to step the simulation forward in time. // TODO-1.2 ping-pong the velocity buffers + + dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize); + + // Swap Velocity Buffers + glm::vec3 *temp = dev_vel1; + dev_vel1 = dev_vel2; + dev_vel2 = temp; + + // Update Velocities and Positions + kernUpdateVelocityBruteForce<<>>(numObjects, dev_pos, dev_vel1, dev_vel2); + kernUpdatePos<<>>(numObjects, dt, dev_pos, dev_vel2); } void Boids::stepSimulationScatteredGrid(float dt) { @@ -364,6 +630,32 @@ void Boids::stepSimulationScatteredGrid(float dt) { // - Perform velocity updates using neighbor search // - Update positions // - Ping-pong buffers as needed + + dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize); + + // Swap Velocity Buffers + cudaMemcpy(dev_vel1, dev_vel2, numObjects * sizeof(glm::vec3), cudaMemcpyDeviceToDevice); + + // Bin the Boids + kernComputeIndices<<>>(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices); + + // Sort + thrust::device_ptr dev_thrust_keys(dev_particleGridIndices); + thrust::device_ptr dev_thrust_values(dev_particleArrayIndices); + thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + numObjects, dev_thrust_values); + + // Calculate boids in each cell + kernResetIntBuffer << > >(numObjects, dev_gridCellStartIndices, -1); + kernResetIntBuffer << > >(numObjects, dev_gridCellEndIndices, -1); + kernIdentifyCellStartEnd << > >(numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices); + + // Update Velocities and Positions + kernUpdateVelNeighborSearchScattered<<>>( + numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth, + dev_gridCellStartIndices, dev_gridCellEndIndices, dev_particleArrayIndices, + dev_pos, dev_vel1, dev_vel2); + + kernUpdatePos<<>>(numObjects, dt, dev_pos, dev_vel2); } void Boids::stepSimulationCoherentGrid(float dt) { @@ -382,6 +674,36 @@ void Boids::stepSimulationCoherentGrid(float dt) { // - Perform velocity updates using neighbor search // - Update positions // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE. + + + dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize); + + // Swap Velocity Buffers + cudaMemcpy(dev_vel1, dev_vel2, numObjects * sizeof(glm::vec3), cudaMemcpyDeviceToDevice); + + // Bin the Boids + kernComputeIndices<<>>(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices); + + // Sort + thrust::device_ptr dev_thrust_keys(dev_particleGridIndices); + thrust::device_ptr dev_thrust_values(dev_particleArrayIndices); + thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + numObjects, dev_thrust_values); + + // Reoder arrays + kernSortArray << > >(numObjects, dev_particleArrayIndices, dev_vel1, dev_vel_sorted); + kernSortArray << > >(numObjects, dev_particleArrayIndices, dev_pos, dev_pos_sorted); + + // Calculate boids in each cell + kernResetIntBuffer << > >(numObjects, dev_gridCellStartIndices, -1); + kernResetIntBuffer << > >(numObjects, dev_gridCellEndIndices, -1); + kernIdentifyCellStartEnd << > >(numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices); + + // Update Velocities and Positions + kernUpdateVelNeighborSearchCoherent<<>>( + numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth, + dev_gridCellStartIndices, dev_gridCellEndIndices, dev_pos, dev_vel1, dev_vel2, dev_pos_sorted, dev_vel_sorted); + + kernUpdatePos<<>>(numObjects, dt, dev_pos, dev_vel2); } void Boids::endSimulation() { @@ -390,6 +712,12 @@ void Boids::endSimulation() { cudaFree(dev_pos); // TODO-2.1 TODO-2.3 - Free any additional buffers here. + cudaFree(dev_particleArrayIndices); + cudaFree(dev_particleGridIndices); + cudaFree(dev_gridCellStartIndices); + cudaFree(dev_gridCellEndIndices); + cudaFree(dev_pos_sorted); + cudaFree(dev_vel_sorted); } void Boids::unitTest() { diff --git a/src/main.cpp b/src/main.cpp index e416836..ec1a3ea 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -14,8 +14,16 @@ // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID #define VISUALIZE 1 -#define UNIFORM_GRID 0 -#define COHERENT_GRID 0 +#define UNIFORM_GRID 1 +#define COHERENT_GRID 1 + +// fix the fact that VC++ 2010 is not C99 compliant +#if defined(_WIN32) || defined(_WIN64) +#define fmax max +#define fmin min +#pragma warning (disable:4996) +#define snprintf sprintf_s +#endif // LOOK-1.2 - change this to adjust particle count in the simulation const int N_FOR_VIS = 5000; @@ -220,21 +228,35 @@ void initShaders(GLuint * program) { double timebase = 0; int frame = 0; - Boids::unitTest(); // LOOK-1.2 We run some basic example code to make sure + float avgFrame = 0.0f; + int totalFrames = 0; + double timebase2 = 0; + + //Boids::unitTest(); // LOOK-1.2 We run some basic example code to make sure // your CUDA development setup is ready to go. while (!glfwWindowShouldClose(window)) { glfwPollEvents(); frame++; + totalFrames++; double time = glfwGetTime(); + // running average time per 100 frames + avgFrame = .99*avgFrame + .01*(time - timebase2); + timebase2 = time; + if (totalFrames % 100 == 0) { + printf("Average Execution Time: %f\n", avgFrame); + totalFrames = 0; + } + if (time - timebase > 1.0) { fps = frame / (time - timebase); - timebase = time; - frame = 0; + timebase = time; + frame = 0; } + runCUDA(); std::ostringstream ss;