GPU Traversal of KdTree for N-body problems
This is my senior year undergrad research project. It is an implementation of this paper:
https://engineering.purdue.edu/~milind/docs/sc13.pdf
Basically it uses autorope and lockstep traversal to improve the performance for kdtree traversal on GPU
1.Reference
The kdtree construction method and CPU traversal of kdTree is downloaded from http://nghiaho.com/?p=437 for comparation.
The correctness of the original code is tested by the ANN library. So I assume the result is correct. Then I removed the ANN library and add my autorope traversal method and compare the result, and the result is correct. Since it uses same dataset every time, the result of my implementation is also correct. Then I delete the original traversal and computation function and added my lockstep traversal method, and compare it with the result from autorope. The results are also same.
2.Files and Functions
Main.cu: handle input files and points sizes, call kdtree build function, traversal functions, calculate time for each method(lockstep and autorope).
CUDA_KDtree.cu: Lockstep and autorope traversal for KNN. Kernal launch functions, memory copy function, kdtree transfer to gpu function, distance computation function. Local stack, can only support for up to 50000 data points, but no limit for data points as long as the GPU global memory is enough.
SearchAtNode function: the main traversal function Search function: extra function, will be deleted Search Batch: Kernal function of traversal CUDA_KDTree::CreateKDTree: copy the content of the KDtree created by main function, and then memcpy that to GPU pointer CUDA_KDTree::Search: handle memory allocate, kernal launch, memory copy and transfer back, and free.
CUDA_KDtree1.cu: Lockstep and autorope traversal for KNN. Stack with global memory, significantly slow, but can support large number of data points. Functions similar with CUDA_KDtree.cu
KDtree.cu: the kdtree layout and build functions on CPU.
Timers.cpp: the timer function used
KdTree traversal implementation based on Michael Goldfarb's paper: General Transformations for GPU Execution of Tree Traversals
The result is from 50k points, 100k query points for 5 nearest neighbor.
Result of local stack
Result of Global memory stack
Tested Result for Autorope and Lockstep Traversal(KNN)
The input data is from dataset directory
Tested code is in GPU_tests folder(compile with CUDA_KDTree_KNN)
Result Compare with CPU(NN using ANN library)
Tested code is in NN_Compare folder
Graph of compute time(Autorope, Lockstep, ANN)
Tested on G2-D3-1000000.csv
Compare results for real_dataset: 5 nearest neighbor
file:IHEPC.csv
N: 2075259
autorope: 12.7538 s
lockstep: 13.4201 s
kNN_kdtree: 114.39 s