Skip to content

Implementation and performance evaluation of SparseparallelMatrixVectorMultiplication implementations, considering: several kinds of partitioning schemes and options for CSR and ELL in openMP various assigement of data chunk to warp and thread in CUDA. various compile/launch time modification such as the use of an auxiliary row lens vector for p…

License

Notifications You must be signed in to change notification settings

andreadiiorio/SpMV_openMP_CUDA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SpMV - Sparse parallel Matrix Vector Multiplication

Andrea Di Iorio

Several implementations of Sparse parallel MatrixVector Multiplication in openMP and CUDA #Implementations brief description following an incremental numbering scheme (also used in .tex report and presentation) the implementations are: ##openMP: CSR format implementations -sgemvSerial serial implementations -spmvRowsBasicCSR 1 row per thread -spmvRowsBlocksCSR 1 block of row per thread -spmvTilesCSR 1 2D sub-block of the matrix per thread, inplace partitionated -spmvTilesAllocdCSR 1 2D sub-block of the matrix per thread, separate CSR per col partition ELL format implementations -spmvRowsBasicELL 1 row per thread -spmvRowsBlocksELL 1 row block per thread -spmvTilesELL 1 2D sub-block per thread ##CUDA CSR format implementations -cudaSpMVRowsCSR 1 thread per row -cudaSpMVWarpPerRowCSR 1 warp per row ELL format implementations -cudaSpMVRowsELL 1 thread per row, transposing the matrix and pitching for coalescing -cudaSpMVWarpsPerRowELLNTrasposed 1 warp per row (without trasposing since not necessary)

#Configurations evaluated Beside the different implementations/partitioning schemes for the matrix I've evaluated these additional configuarion, both ##launch time -partitioning grid size (gridRowsx`gridRows) -scheduling configuration via omp env var: static dynamic (with chunk size adapatiation of static's chunk size / FAIR_CHUNKS_FOLDING(4) ##compile time auxiliary rows lens vector (useful for premature ending of (parzial) point product accumulation with ELL's padding) SIMD reduction (not usefull because of sparsity) #Compilation single main files builded separatelly with gcc for openMP only implementations and with nvcc (unfortunatelly c++ frontend) for both openMP and CUDA implementations using CUDACC nvcc exported macro to include in nvcc compilation, C extended statements for CUDA implementations #Testing threshold based confront of numerical result with a serial implementation, validating it with confront with CBLAS numerical result (previous a dense transformation of the problem)

See test/ #Perfomarce TODO include pdfs in doc/ or tables in test/

About

Implementation and performance evaluation of SparseparallelMatrixVectorMultiplication implementations, considering: several kinds of partitioning schemes and options for CSR and ELL in openMP various assigement of data chunk to warp and thread in CUDA. various compile/launch time modification such as the use of an auxiliary row lens vector for p…

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published