NCTU IEE 2016 Fall
Computer Architecture Final Project

Annotation

This was a challenge project at NCTU (National Chiao-Tung University) to use CUDA parallel computation framework for speeding up computation of one ConvNet layer. Whichever team acheive maximum speedup using GPU compared to CPU wins. This code won first place in the first round, 4th place in 2nd round and 1st place overall.

Each team was provided with one the server with NVidia GTX680 GPU on board. Same one. Yes, each team was provided with the same server, and same GPU. Simultaneously. Feel the pain.

Methods to acheive maximum speedup included usage of sparse arrays, shared GPU memory and loop unrolling. Loop unrolling gave about 0.5ms speedup boost which resulted in 1st place of the first round. Another trick was to switch compiler architecture from default (compute_10) to a better one (compute_30). Main memory in compute_30 is cached instead of compute_10, which results in a reasonable speedup.

Full report is available inside this repository as well.

ORIGINAL TASK:

Part-I: Use CUDA to accelerate the operations of a typical convolutional layer in often-used large-scale neural networks. (You can find the description slides here)
Part-II: Accelerate a sparse convolutional layer with CUDA. (You can find the description slides here)

Three sub-directory

./data

This directory contains the input data for the base program

/data/filt.txt - Store the values of filters
/data/filt.coo - Store the values of filters in COO format
/data/inNeu.txt - Store the values of input neurons
/data/inNeu.coo - Store the values of input neurons in COO format

./innerProduct

This is the example to show you how to use CUDA to accelerate Inner Product

Usage

cd ./innerProduct
make
make run

./device

The program under this directory can show the device information

Usage

cd ./device
make
make run

Usage of the base program

Get the code and data for part-II into a new branch

git clone https://github.com/OwlSoul/ConvLayer_CUDA.git

Compile the code

make

Run the code

make run

Task

Put the input data in sparse format and reimplement your CUDA kernels
Use NVIDIA Visual Profiler to analyze and improve your code
Optimize your CUDA kernels for the sparse format
Improve the input data format (like using other sparse format rather than COO)

Evaluation

convLayerCPU() will do the computation with C++ and store the output in the outCPU
checker() will check whether the values stored in outCPU and outGPU are the same
- Store your result in the outGPU in dense format
- You must pass the checking to ensure your result is correct!
Use nvvp (or nvprof) to measure the kernel execution time and data transfer time

TA will use TotalExecTime to evaluate your preformance

  DataTransTime = DataHostToDeviceTime + DataDeviceToHostTime
  TotalExecTime = GPUKernelsExecTime + DataTransTime

Rules

It’s team work, 1 ~ 3 people in one team
Compress your code and report into one zip file and upload to E3 system
- Name your package as: LeaderID_FP2.zip
- One team only need to upload one package to E3 system
- Please name your report as: LeaderID_Report_FP2.pdf
- Make sure TA can compile and run your code on the provided server
Any CUDA library is forbidden to use in this project
Delay is NOT acceptable
Any plagiarism will make you get zero point

Useful Reference

Part-I

LeNet: Gradient Based Learning Applied to Document Recognition
AlexNet: ImageNet Classification with Deep Convolutional Neural Networks
CNN: Standford CS231n Convolutional Neural Networks for Visual Recognition
CUDA Tutorial: CUDA C/C++ Basics
CNN with CUDA: Optimizing Convolution Operations in CUDA with Adaptive Tiling convolution on gpu
GPU Profiling: GPU Performance Analysis and Optimisation
GPU Profiling: CUDA Profiling Documentation

Part-II

Network pruning: Learning both Weights and Connections for Efficient Neural Networks
Sparsity in Neurons: Cnvlutin: Ineffectual-neuron-free Deep Neural Network Computing
Sparse data GPU: Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
Sparse data with CUDA: Efficient Sparse Matrix-Vector Multiplication on CUDA

TA: Chien-Yu Lin
Email: myislin@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
device		device
innerProduct		innerProduct
reports		reports
.gitignore		.gitignore
.remote-sync.json		.remote-sync.json
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
cnnConvLayer		cnnConvLayer
cnnConvLayer.cu		cnnConvLayer.cu
cnnConvLayer.h		cnnConvLayer.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NCTU IEE 2016 Fall
Computer Architecture Final Project

Annotation

Contents

ORIGINAL TASK:

Three sub-directory

./data

./innerProduct

Usage

./device

Usage

Usage of the base program

Get the code and data for part-II into a new branch

Compile the code

Run the code

Task

Evaluation

Rules

Useful Reference

Part-I

Part-II

About

Releases

Packages

Contributors 2

Languages

License

OwlSoul/ConvLayer_CUDA

Folders and files

Latest commit

History

Repository files navigation

NCTU IEE 2016 Fall Computer Architecture Final Project

Annotation

Contents

ORIGINAL TASK:

Three sub-directory

./data

./innerProduct

Usage

./device

Usage

Usage of the base program

Get the code and data for part-II into a new branch

Compile the code

Run the code

Task

Evaluation

Rules

Useful Reference

Part-I

Part-II

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

NCTU IEE 2016 Fall
Computer Architecture Final Project

Packages