This was a challenge project at NCTU (National Chiao-Tung University) to use CUDA parallel computation framework for speeding up computation of one ConvNet layer. Whichever team acheive maximum speedup using GPU compared to CPU wins. This code won first place in the first round, 4th place in 2nd round and 1st place overall.
Each team was provided with one the server with NVidia GTX680 GPU on board. Same one. Yes, each team was provided with the same server, and same GPU. Simultaneously. Feel the pain.
Methods to acheive maximum speedup included usage of sparse arrays, shared GPU memory and loop unrolling. Loop unrolling gave about 0.5ms speedup boost which resulted in 1st place of the first round. Another trick was to switch compiler architecture from default (compute_10) to a better one (compute_30). Main memory in compute_30 is cached instead of compute_10, which results in a reasonable speedup.
Full report is available inside this repository as well.
Original Task
Three sub-directory
./data
/.innerProduct
./device
Usage of the base program
Task
Evaluation
Rules
Useful references
Part-I: Use CUDA to accelerate the operations of a typical convolutional layer in often-used large-scale neural networks. (You can find the description slides here)
Part-II: Accelerate a sparse convolutional layer with CUDA. (You can find the description slides here)
This directory contains the input data for the base program
- /data/filt.txt - Store the values of filters
- /data/filt.coo - Store the values of filters in COO format
- /data/inNeu.txt - Store the values of input neurons
- /data/inNeu.coo - Store the values of input neurons in COO format
This is the example to show you how to use CUDA to accelerate Inner Product
cd ./innerProduct
make
make run
The program under this directory can show the device information
cd ./device
make
make run
git clone https://github.com/OwlSoul/ConvLayer_CUDA.git
make
make run
- Put the input data in sparse format and reimplement your CUDA kernels
- Use NVIDIA Visual Profiler to analyze and improve your code
- Optimize your CUDA kernels for the sparse format
- Improve the input data format (like using other sparse format rather than COO)
-
convLayerCPU() will do the computation with C++ and store the output in the outCPU
-
checker() will check whether the values stored in outCPU and outGPU are the same
- Store your result in the outGPU in dense format
- You must pass the checking to ensure your result is correct!
-
Use nvvp (or nvprof) to measure the kernel execution time and data transfer time
-
TA will use TotalExecTime to evaluate your preformance
DataTransTime = DataHostToDeviceTime + DataDeviceToHostTime TotalExecTime = GPUKernelsExecTime + DataTransTime
- It’s team work, 1 ~ 3 people in one team
- Compress your code and report into one zip file and upload to E3 system
- Name your package as: LeaderID_FP2.zip
- One team only need to upload one package to E3 system
- Please name your report as: LeaderID_Report_FP2.pdf
- Make sure TA can compile and run your code on the provided server
- Any CUDA library is forbidden to use in this project
- Delay is NOT acceptable
- Any plagiarism will make you get zero point
- LeNet: Gradient Based Learning Applied to Document Recognition
- AlexNet: ImageNet Classification with Deep Convolutional Neural Networks
- CNN: Standford CS231n Convolutional Neural Networks for Visual Recognition
- CUDA Tutorial: CUDA C/C++ Basics
- CNN with CUDA: Optimizing Convolution Operations in CUDA with Adaptive Tiling convolution on gpu
- GPU Profiling: GPU Performance Analysis and Optimisation
- GPU Profiling: CUDA Profiling Documentation
- Network pruning: Learning both Weights and Connections for Efficient Neural Networks
- Sparsity in Neurons: Cnvlutin: Ineffectual-neuron-free Deep Neural Network Computing
- Sparse data GPU: Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
- Sparse data with CUDA: Efficient Sparse Matrix-Vector Multiplication on CUDA
TA: Chien-Yu Lin
Email: myislin@gmail.com