diff --git a/README.md b/README.md index d63a6a1..ddaf1f5 100644 --- a/README.md +++ b/README.md @@ -1,11 +1,88 @@ **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 1 - Flocking** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Zhenzhong Tang + * [LinkedIn](https://www.linkedin.com/in/zhenzhong-anthony-tang-82334a210), [Instagram](https://instagram.com/toytag12), [personal website](https://toytag.net/) +* Tested on: Windows 11 Pro 22H2, AMD EPYC 7V12 64-Core Processor (4 vCPU cores) @ 2.44GHz 28GiB, Tesla T4 16GiB (Azure) -### (TODO: Your README) +# Boids, A Flocking Simulation -Include screenshots, analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +![](images/demo.gif) + +This demo is a CUDA-based coherent uniform grid search implementation of the [Boids](https://en.wikipedia.org/wiki/Boids) flocking simulation algorithm. The algorithm, developed by Craig Reynolds in 1986, simulates the flocking behaviour of birds. The goal of this project is to implement the algorithm with CUDA in different ways and optimize performance. + + +### Naive Implementation + +A simple implementation of the algorithm is to iterate through each boid and calculate the updated velocity and position based on [three rules](http://www.vergenet.net/~conrad/boids/pseudocode.html). Even with CUDA parallelization, this method is slow because each boid needs to check the position of every other boid. There are $O(n^2)$ global memory access, causing the CUDA program to be heavily memory-bounded. + + +### Uniform Grid Search (Scattered Memory) + +To avoid checking every other boids in the simulation, we can use a uniform grid to divide the space into cells. Each boid is assigned to a cell based on its position. Then, we only need to check the boids in the same cell and its neighboring cells. + +![](images/Boids%20Ugrid%20neighbor%20search%20shown.png) + +The image shows a simplified version in 2d plane, with grid cell width equals to 2 times the maximum search radius. Each boids would only need to check at most 4 neighbor cells (or in 3d case, 8 neighbor cells). This significantly reduces the number of global memory access. + + +### Uniform Grid Search (Coherent Memory) + +We can store the sorted position and velocity of boids based on cell indexes, so that boids data are contiguous in memory from the perspective of grid cells. The specific implementation eliminates an overhead associated with fetching a pointer index from global memory to access the position and velocity of boids in global memory. Interestingly, one parallelizing sort is considerably faster than the cost of random fetches of global memory. Also, ***coalesced*** memory access benefits more from cache and compiler optimizations. + +![](images/Coherent%20Boids%20in%20Mem.png) + +## Performance Analysis + +We use average Frames Per Second (FPS) to compare the performance of different implementations. Each kernel function call has a block size of 128, which in most cases leads to better performance than other block sizes. + +### Number of Boids + +![](images/FPS%20with%20Visualization%20ON.svg) +![](images/FPS%20with%20Visualization%20OFF.svg) + +We see the performance of naive implementation is worse than grid searches. The coherent grid search implementation is the fastest, with a 3x speedup compared to the naive implementation with 5,000 boids. With more boids, we see more significant speedup. The coherent grid search is even 10x faster than the scattered grid search in 500,000 and 1,000,000 boids tests. + +Though with visualization on, the peek performance of the coherent grid search is lower than turning off the visualization. It could be argued that the visualization is the bottleneck, but the performance of the coherent grid search is still better than others. + +### Block Size + +![](images/FPS%20with%205,000%20Boids%20and%20Visualization%20OFF.svg) + +Selecting an appropriate block size is crucial to effectively utilize the available resources on the GPU, ensuring that all Streaming Multiprocessors (SMs) are actively engaged, and there are enough warps to hide memory access latencies. But in our tests, we didn't see significant difference in performance with different block sizes. All there implementation tend to perform a bit worse when block size is too large, like 1024, or too small, like 16. + +### Fine-Grained Cells (8 v.s. 27 Neighbors) + +![](images/FPS%20with%20Visualization%20OFF,%208%20Grids%20v.s.%2027%20Grids%20Search.svg) + +We see even more improved results when we set the grid cell width to 1 times the maximum search radius, so that each boid would need to check at most 27 neighbor cells. With fine-grained search, the coherent grid search is more than 5x faster than the 8-cell coherent grid search in 500,000 and 1,000,000 boids tests. This is huge improvement. To put it more intuitively, see the stacked comparison. 27-cell coherent grid search in 500,000 and 1,000,000 tests is dominating the others. + +![](images/FPS%20with%20Visualization%20OFF,%208%20Grids%20v.s.%2027%20Grids%20Search,%20Stacked.svg) + +### Profile of Uniform Grid Search (Scattered Memory) and Uniform Grid Search (Coherent Memory) + +Current(blue) is the coherent grid search, and baseline(green) is the scattered grid search. +![](images/scattered-coherent.png) + +### Profile of 27-cell Coherent Grid Search and 8-cell Coherent Grid Search + +Current(blue) is the 27-cell coherent grid search, and baseline(green) is the 8-cell coherent grid search. +![](images/coherent-8-27.png) + +### QA + +#### For each implementation, how does changing the number of boids affect performance? Why do you think this is? + +With more boids, all three implementations perform worse. This is because with more boids, there are more things to sort, more memory access, more thread/warp/block management, and more computation. + +#### For each implementation, how does changing the block count and block size affect performance? Why do you think this is? + +Changing the block size has a relatively minor impact on the results, with only slight decreases in performance observed when block size is 16 (too small) or 1024 (too large). This limited effect can be attributed to the fact that the fundamental thread scheduling is managed by a "warp" of 32 threads. + +#### For the coherent uniform grid: did you experience any performance improvements with the more coherent uniform grid? Was this the outcome you expected? Why or why not? + +The coherent grid search is faster than the scattered grid search, especially when the number of boids is huge. I didn't expect to see such big leap. I think this is because the coherent grid search has better memory access pattern, and the overhead of sorting is less than the overhead of global memory access. + +#### Did changing cell width and checking 27 vs 8 neighboring cells affect performance? Why or why not? + +Yes, using 27 neighbor cells improved the performance by quite a lot. On one hand there are more cells to search, but on the other hand, the search is more accurate, so that we can avoid checking boids that are not in distance from the current boid, which results in less global memory access. This is especially useful when the number of boids is large. All in all, our program is still **memory-bounded**. diff --git a/images/Coherent Boids in Mem.png b/images/Coherent Boids in Mem.png new file mode 100644 index 0000000..70b9c16 Binary files /dev/null and b/images/Coherent Boids in Mem.png differ diff --git a/images/FPS with 5,000 Boids and Visualization OFF.svg b/images/FPS with 5,000 Boids and Visualization OFF.svg new file mode 100644 index 0000000..8edc3bf --- /dev/null +++ b/images/FPS with 5,000 Boids and Visualization OFF.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/images/FPS with Visualization OFF, 8 Grids v.s. 27 Grids Search, Stacked.svg b/images/FPS with Visualization OFF, 8 Grids v.s. 27 Grids Search, Stacked.svg new file mode 100644 index 0000000..46f8062 --- /dev/null +++ b/images/FPS with Visualization OFF, 8 Grids v.s. 27 Grids Search, Stacked.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/images/FPS with Visualization OFF, 8 Grids v.s. 27 Grids Search.svg b/images/FPS with Visualization OFF, 8 Grids v.s. 27 Grids Search.svg new file mode 100644 index 0000000..fa5562d --- /dev/null +++ b/images/FPS with Visualization OFF, 8 Grids v.s. 27 Grids Search.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/images/FPS with Visualization OFF.svg b/images/FPS with Visualization OFF.svg new file mode 100644 index 0000000..bc555a3 --- /dev/null +++ b/images/FPS with Visualization OFF.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/images/FPS with Visualization ON.svg b/images/FPS with Visualization ON.svg new file mode 100644 index 0000000..2c05188 --- /dev/null +++ b/images/FPS with Visualization ON.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/images/coherent-8-27.png b/images/coherent-8-27.png new file mode 100644 index 0000000..f6b977d Binary files /dev/null and b/images/coherent-8-27.png differ diff --git a/images/demo.gif b/images/demo.gif new file mode 100644 index 0000000..0cd9bfa Binary files /dev/null and b/images/demo.gif differ diff --git a/images/scattered-8-27.png b/images/scattered-8-27.png new file mode 100644 index 0000000..1db8dfe Binary files /dev/null and b/images/scattered-8-27.png differ diff --git a/images/scattered-coherent.png b/images/scattered-coherent.png new file mode 100644 index 0000000..e734cb1 Binary files /dev/null and b/images/scattered-coherent.png differ diff --git a/src/kernel.cu b/src/kernel.cu index 74dffcb..f3f7f1c 100644 --- a/src/kernel.cu +++ b/src/kernel.cu @@ -36,6 +36,9 @@ void checkCUDAError(const char *msg, int line = -1) { * Configuration * *****************/ +// Search 27 nearby cells instead of 8, uncomment to enable. +//#define SEARCH27 + /*! Block size used for CUDA kernel launch. */ #define blockSize 128 @@ -85,6 +88,8 @@ int *dev_gridCellEndIndices; // to this cell? // TODO-2.3 - consider what additional buffers you might need to reshuffle // the position and velocity data to be coherent within cells. +glm::vec3 *dev_coherentPos; +glm::vec3 *dev_coherentVel; // LOOK-2.1 - Grid parameters based on simulation parameters. // These are automatically computed for you in Boids::initSimulation @@ -157,7 +162,11 @@ void Boids::initSimulation(int N) { checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!"); // LOOK-2.1 computing grid params +#ifdef SEARCH27 + gridCellWidth = 1.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance); +#else gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance); +#endif int halfSideCount = (int)(scene_scale / gridCellWidth) + 1; gridSideCount = 2 * halfSideCount; @@ -169,6 +178,27 @@ void Boids::initSimulation(int N) { gridMinimum.z -= halfGridWidth; // TODO-2.1 TODO-2.3 - Allocate additional buffers here. + cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int)); + checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!"); + + cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int)); + checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!"); + + dev_thrust_particleArrayIndices = thrust::device_pointer_cast(dev_particleArrayIndices); + dev_thrust_particleGridIndices = thrust::device_pointer_cast(dev_particleGridIndices); + + cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int)); + checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!"); + + cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int)); + checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!"); + + cudaMalloc((void**)&dev_coherentPos, N * sizeof(glm::vec3)); + checkCUDAErrorWithLine("cudaMalloc dev_coherentPos failed!"); + + cudaMalloc((void**)&dev_coherentVel, N * sizeof(glm::vec3)); + checkCUDAErrorWithLine("cudaMalloc dev_coherentVel failed!"); + cudaDeviceSynchronize(); } @@ -229,11 +259,37 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities) * Compute the new velocity on the body with index `iSelf` due to the `N` boids * in the `pos` and `vel` arrays. */ -__device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) { - // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves - // Rule 2: boids try to stay a distance d away from each other - // Rule 3: boids try to match the speed of surrounding boids - return glm::vec3(0.0f, 0.0f, 0.0f); +__device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3* pos, const glm::vec3* vel) { + int numNeighbors1 = 0, numNeighbors3 = 0; + glm::vec3 pc{ 0.0f }, c{ 0.0f }, pv{ 0.0f }; + for (int i = 0; i < N; i++) { + if (i != iSelf) { + float distance = glm::distance(pos[i], pos[iSelf]); + // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves + if (distance < rule1Distance) { + pc += pos[i]; + numNeighbors1++; + } + // Rule 2: boids try to stay a distance d away from each other + if (distance < rule2Distance) c -= (pos[i] - pos[iSelf]); + // Rule 3: boids try to match the speed of surrounding boids + if (distance < rule3Distance) { + pv += vel[i]; + numNeighbors3++; + } + } + } + glm::vec3 velDelta{ 0.0f }; + if (numNeighbors1 != 0) { + pc /= numNeighbors1; + velDelta += (pc - pos[iSelf]) * rule1Scale; + } + velDelta += c * rule2Scale; + if (numNeighbors3 != 0) { + pv /= numNeighbors3; + velDelta += pv * rule3Scale; + } + return velDelta; } /** @@ -243,8 +299,18 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) { // Compute a new velocity based on pos and vel1 + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= N) { + return; + } + glm::vec3 newVel = vel1[index] + computeVelocityChange(N, index, pos, vel1); // Clamp the speed + if (glm::length(newVel) > maxSpeed) { + newVel = glm::normalize(newVel) * maxSpeed; + } // Record the new velocity into vel2. Question: why NOT vel1? + // Answer: other threads may be still reading from vel1 to compute velocity changes + vel2[index] = newVel; } /** @@ -285,10 +351,17 @@ __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) { __global__ void kernComputeIndices(int N, int gridResolution, glm::vec3 gridMin, float inverseCellWidth, glm::vec3 *pos, int *indices, int *gridIndices) { - // TODO-2.1 - // - Label each boid with the index of its grid cell. - // - Set up a parallel array of integer indices as pointers to the actual - // boid data in pos and vel1/vel2 + // TODO-2.1 + // - Label each boid with the index of its grid cell. + // - Set up a parallel array of integer indices as pointers to the actual + // boid data in pos and vel1/vel2 + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= N) { + return; + } + indices[index] = index; + glm::ivec3 gridIndex = glm::floor((pos[index] - gridMin) * inverseCellWidth); + gridIndices[index] = gridIndex3Dto1D(gridIndex.x, gridIndex.y, gridIndex.z, gridResolution); } // LOOK-2.1 Consider how this could be useful for indicating that a cell @@ -306,6 +379,24 @@ __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices, // Identify the start point of each cell in the gridIndices array. // This is basically a parallel unrolling of a loop that goes // "this index doesn't match the one before it, must be a new cell!" + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + int currentGrid = particleGridIndices[index]; + if (index == 0) { + gridCellStartIndices[currentGrid] = index; + } + else if (index < N) { + int previousGrid = particleGridIndices[index - 1]; + if (currentGrid != previousGrid) { + gridCellEndIndices[previousGrid] = index - 1; + gridCellStartIndices[currentGrid] = index; + } + if (index == N - 1) { + gridCellEndIndices[currentGrid] = index; + } + } + else { + return; + } } __global__ void kernUpdateVelNeighborSearchScattered( @@ -316,12 +407,77 @@ __global__ void kernUpdateVelNeighborSearchScattered( glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) { // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce // the number of boids that need to be checked. + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= N) { + return; + } // - Identify the grid cell that this particle is in // - Identify which cells may contain neighbors. This isn't always 8. + glm::vec3 thisPos = pos[index], thisVel = vel1[index]; + float r = imax(imax(rule1Distance, rule2Distance), rule3Distance); + glm::ivec3 gridIndexStart3d = glm::floor((thisPos - r - gridMin) * inverseCellWidth); + glm::ivec3 gridIndexEnd3d = glm::floor((thisPos + r - gridMin) * inverseCellWidth); // - For each cell, read the start/end indices in the boid pointer array. // - Access each boid in the cell and compute velocity change from // the boids rules, if this boid is within the neighborhood distance. + int numNeighbors1 = 0, numNeighbors3 = 0; + glm::vec3 pc{ 0.0f }, c{ 0.0f }, pv{ 0.0f }; + for (int x = gridIndexStart3d.x; x <= gridIndexEnd3d.x; x++) { + for (int y = gridIndexStart3d.y; y <= gridIndexEnd3d.y; y++) { + for (int z = gridIndexStart3d.z; z <= gridIndexEnd3d.z; z++) { + int gridIndex1d = gridIndex3Dto1D(x, y, z, gridResolution); + int startIndex = gridCellStartIndices[gridIndex1d], endIndex = gridCellEndIndices[gridIndex1d]; + if (startIndex == -1 || endIndex == -1) { + continue; + } + for (int i = startIndex; i <= endIndex; i++) { + int boidIndex = particleArrayIndices[i]; + if (boidIndex == index) { + continue; + } + glm::vec3 otherPos = pos[boidIndex], otherVel = vel1[boidIndex]; + float distance = glm::distance(otherPos, thisPos); + // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves + if (distance < rule1Distance) { + pc += otherPos; + numNeighbors1++; + } + // Rule 2: boids try to stay a distance d away from each other + if (distance < rule2Distance) c -= (otherPos - thisPos); + // Rule 3: boids try to match the speed of surrounding boids + if (distance < rule3Distance) { + pv += otherVel; + numNeighbors3++; + } + } + } + } + } + glm::vec3 newVel{ thisVel }; + if (numNeighbors1 != 0) { + pc /= numNeighbors1; + newVel += (pc - thisPos) * rule1Scale; + } + newVel += c * rule2Scale; + if (numNeighbors3 != 0) { + pv /= numNeighbors3; + newVel += pv * rule3Scale; + } // - Clamp the speed change before putting the new speed in vel2 + if (glm::length(newVel) > maxSpeed) { + newVel = glm::normalize(newVel) * maxSpeed; + } + vel2[index] = newVel; +} + +__global__ void kernRearrangePosVel(int N, int *particleArrayIndices, glm::vec3 *pos, glm::vec3 *vel, glm::vec3 *coherentPos, glm::vec3 *coherentVel) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= N) { + return; + } + int arrIndex = particleArrayIndices[index]; + coherentPos[index] = pos[arrIndex]; + coherentVel[index] = vel[arrIndex]; } __global__ void kernUpdateVelNeighborSearchCoherent( @@ -333,14 +489,68 @@ __global__ void kernUpdateVelNeighborSearchCoherent( // except with one less level of indirection. // This should expect gridCellStartIndices and gridCellEndIndices to refer // directly to pos and vel1. + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= N) { + return; + } // - Identify the grid cell that this particle is in // - Identify which cells may contain neighbors. This isn't always 8. + glm::vec3 thisPos = pos[index], thisVel = vel1[index]; + float r = imax(imax(rule1Distance, rule2Distance), rule3Distance); + glm::ivec3 gridIndexStart3d = glm::floor((thisPos - r - gridMin) * inverseCellWidth); + glm::ivec3 gridIndexEnd3d = glm::floor((thisPos + r - gridMin) * inverseCellWidth); // - For each cell, read the start/end indices in the boid pointer array. // DIFFERENCE: For best results, consider what order the cells should be // checked in to maximize the memory benefits of reordering the boids data. // - Access each boid in the cell and compute velocity change from // the boids rules, if this boid is within the neighborhood distance. + int numNeighbors1 = 0, numNeighbors3 = 0; + glm::vec3 pc{ 0.0f }, c{ 0.0f }, pv{ 0.0f }; + for (int z = gridIndexStart3d.z; z <= gridIndexEnd3d.z; z++) { + for (int y = gridIndexStart3d.y; y <= gridIndexEnd3d.y; y++) { + for (int x = gridIndexStart3d.x; x <= gridIndexEnd3d.x; x++) { + int gridIndex1d = gridIndex3Dto1D(x, y, z, gridResolution); + int startIndex = gridCellStartIndices[gridIndex1d], endIndex = gridCellEndIndices[gridIndex1d]; + if (startIndex == -1 || endIndex == -1) { + continue; + } + for (int boidIndex = startIndex; boidIndex <= endIndex; boidIndex++) { + if (boidIndex == index) { + continue; + } + glm::vec3 otherPos = pos[boidIndex], otherVel = vel1[boidIndex]; + float distance = glm::distance(otherPos, thisPos); + // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves + if (distance < rule1Distance) { + pc += otherPos; + numNeighbors1++; + } + // Rule 2: boids try to stay a distance d away from each other + if (distance < rule2Distance) c -= (otherPos - thisPos); + // Rule 3: boids try to match the speed of surrounding boids + if (distance < rule3Distance) { + pv += otherVel; + numNeighbors3++; + } + } + } + } + } + glm::vec3 newVel{ thisVel }; + if (numNeighbors1 != 0) { + pc /= numNeighbors1; + newVel += (pc - thisPos) * rule1Scale; + } + newVel += c * rule2Scale; + if (numNeighbors3 != 0) { + pv /= numNeighbors3; + newVel += pv * rule3Scale; + } // - Clamp the speed change before putting the new speed in vel2 + if (glm::length(newVel) > maxSpeed) { + newVel = glm::normalize(newVel) * maxSpeed; + } + vel2[index] = newVel; } /** @@ -348,40 +558,66 @@ __global__ void kernUpdateVelNeighborSearchCoherent( */ void Boids::stepSimulationNaive(float dt) { // TODO-1.2 - use the kernels you wrote to step the simulation forward in time. + dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize); + kernUpdateVelocityBruteForce << > > (numObjects, dev_pos, dev_vel1, dev_vel2); + kernUpdatePos << > > (numObjects, dt, dev_pos, dev_vel2); // TODO-1.2 ping-pong the velocity buffers + std::swap(dev_vel1, dev_vel2); } void Boids::stepSimulationScatteredGrid(float dt) { // TODO-2.1 // Uniform Grid Neighbor search using Thrust sort. // In Parallel: + dim3 fullBlocksPerGridForObjects((numObjects + blockSize - 1) / blockSize); + dim3 fullBlocksPerGridForGridCells((gridCellCount + blockSize - 1) / blockSize); + kernResetIntBuffer << > > (numObjects, dev_gridCellStartIndices, -1); + kernResetIntBuffer << > > (numObjects, dev_gridCellEndIndices, -1); // - label each particle with its array index as well as its grid index. // Use 2x width grids. + kernComputeIndices << > > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices); // - Unstable key sort using Thrust. A stable sort isn't necessary, but you // are welcome to do a performance comparison. + thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices); // - Naively unroll the loop for finding the start and end indices of each // cell's data pointers in the array of boid indices + kernIdentifyCellStartEnd << > > (numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices); // - Perform velocity updates using neighbor search + kernUpdateVelNeighborSearchScattered << > > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, dev_particleArrayIndices, dev_pos, dev_vel1, dev_vel2); // - Update positions + kernUpdatePos << > > (numObjects, dt, dev_pos, dev_vel2); // - Ping-pong buffers as needed + std::swap(dev_vel1, dev_vel2); } void Boids::stepSimulationCoherentGrid(float dt) { // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid // Uniform Grid Neighbor search using Thrust sort on cell-coherent data. // In Parallel: + dim3 fullBlocksPerGridForObjects((numObjects + blockSize - 1) / blockSize); + dim3 fullBlocksPerGridForGridCells((gridCellCount + blockSize - 1) / blockSize); + kernResetIntBuffer << > > (numObjects, dev_gridCellStartIndices, -1); + kernResetIntBuffer << > > (numObjects, dev_gridCellEndIndices, -1); // - Label each particle with its array index as well as its grid index. // Use 2x width grids + kernComputeIndices << > > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices); // - Unstable key sort using Thrust. A stable sort isn't necessary, but you // are welcome to do a performance comparison. + thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices); // - Naively unroll the loop for finding the start and end indices of each // cell's data pointers in the array of boid indices + kernIdentifyCellStartEnd << > > (numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices); // - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all // the particle data in the simulation array. - // CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED + kernRearrangePosVel << > > (numObjects, dev_particleArrayIndices, dev_pos, dev_vel1, dev_coherentPos, dev_coherentVel); // - Perform velocity updates using neighbor search + // CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED + kernUpdateVelNeighborSearchCoherent << > > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, dev_coherentPos, dev_coherentVel, dev_vel2); // - Update positions + kernUpdatePos << > > (numObjects, dt, dev_coherentPos, dev_vel2); // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE. + std::swap(dev_vel1, dev_vel2); + std::swap(dev_pos, dev_coherentPos); } void Boids::endSimulation() { @@ -390,6 +626,10 @@ void Boids::endSimulation() { cudaFree(dev_pos); // TODO-2.1 TODO-2.3 - Free any additional buffers here. + cudaFree(dev_particleArrayIndices); + cudaFree(dev_particleGridIndices); + cudaFree(dev_gridCellStartIndices); + cudaFree(dev_gridCellEndIndices); } void Boids::unitTest() { diff --git a/src/main.cpp b/src/main.cpp index b82c8c6..d18b477 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -183,7 +183,7 @@ void initShaders(GLuint * program) { //==================================== // Main loop //==================================== - void runCUDA() { + void runCUDA(cudaEvent_t &start, cudaEvent_t &stop, float &elapsedTime) { // Map OpenGL buffer object for writing from CUDA on a single GPU // No data is moved (Win & Linux). When mapped to CUDA, OpenGL should not // use this buffer @@ -196,6 +196,7 @@ void initShaders(GLuint * program) { cudaGLMapBufferObject((void**)&dptrVertVelocities, boidVBO_velocities); // execute the kernel + cudaEventRecord(start, 0); #if UNIFORM_GRID && COHERENT_GRID Boids::stepSimulationCoherentGrid(DT); #elif UNIFORM_GRID @@ -203,6 +204,9 @@ void initShaders(GLuint * program) { #else Boids::stepSimulationNaive(DT); #endif + cudaEventRecord(stop, 0); + cudaEventSynchronize(stop); + cudaEventElapsedTime(&elapsedTime, start, stop); #if VISUALIZE Boids::copyBoidsToVBO(dptrVertPositions, dptrVertVelocities); @@ -216,6 +220,10 @@ void initShaders(GLuint * program) { double fps = 0; double timebase = 0; int frame = 0; + cudaEvent_t start, stop; + float elapsedTime = 0; + cudaEventCreate(&start); + cudaEventCreate(&stop); Boids::unitTest(); // LOOK-1.2 We run some basic example code to make sure // your CUDA development setup is ready to go. @@ -232,13 +240,17 @@ void initShaders(GLuint * program) { frame = 0; } - runCUDA(); + runCUDA(start, stop, elapsedTime); std::ostringstream ss; ss << "["; ss.precision(1); ss << std::fixed << fps; - ss << " fps] " << deviceName; + ss << " fps] "; + ss.precision(3); + ss << "[" << elapsedTime << " ms] "; + ss << deviceName; + ss << " (" << N_FOR_VIS << " Bodies)"; glfwSetWindowTitle(window, ss.str().c_str()); glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT); @@ -258,6 +270,9 @@ void initShaders(GLuint * program) { } glfwDestroyWindow(window); glfwTerminate(); + + cudaEventDestroy(start); + cudaEventDestroy(stop); } @@ -290,8 +305,8 @@ void initShaders(GLuint * program) { updateCamera(); } - lastX = xpos; - lastY = ypos; + lastX = xpos; + lastY = ypos; } void updateCamera() { diff --git a/src/main.hpp b/src/main.hpp index 88e9df7..dc21070 100644 --- a/src/main.hpp +++ b/src/main.hpp @@ -70,7 +70,7 @@ void keyCallback(GLFWwindow* window, int key, int scancode, int action, int mods void mouseButtonCallback(GLFWwindow* window, int button, int action, int mods); void mousePositionCallback(GLFWwindow* window, double xpos, double ypos); void updateCamera(); -void runCUDA(); +void runCUDA(cudaEvent_t &start, cudaEvent_t &stop, float &elapsedTime); //==================================== // Setup/init Stuff