This repository contains various implementations of Sparse Matrix-Vector Multiplication (SpMV) for both CPU and GPU, developed as part of the GPU Computing graduate course at the University of Trento.
├── bin/ # Compiled executables
├── data/ # Matrix market test files
├── include/ # Header files for types and utilities
├── lib/ # Library implementation files (kernels, utils)
├── obj/ # Object files
├── results/ # CSV files generated by analysis scripts
├── scripts/ # Benchmark, download, and analysis scripts
├── src/ # Source code for all executables
│ ├── spmv_cpu_*.c # CPU implementations
│ └── spmv_gpu_*.cu # GPU implementations
├── deviceQuery/ # NVIDIA device information utility
└── test/ # Experimental code and cuSPARSE implementation
The test/ directory contains experimental code and examples developed during lab sessions. It also contains the cuSPARSE implementation, which can be compiled using test/compile.sh.
- Simple CSR: A basic, single-threaded row-per-thread implementation (
spmv_cpu_csr.c). - ILP: A version optimized with manual loop unrolling to exploit instruction-level parallelism (
spmv_cpu_csr_ilp.c).
- Simple: A basic row-per-thread kernel (
spmv_gpu_simple_csr.cu). - Value Sequential: A value-per-thread kernel using atomic adds, inefficient but illustrative (
spmv_gpu_value_sequential_csr.cu). - Value Blocked: An improved value-parallel kernel with strided access (
spmv_gpu_value_blocked_csr.cu). - Vector (Warp-per-Row): A kernel that assigns one warp to process each row (
spmv_gpu_vector_csr.cu). - Vector Double Buffer: An optimized vector kernel that processes two rows per warp to improve occupancy (
spmv_gpu_vector_test_csr.cu). - Adaptive Row Blocks: A kernel that dynamically assigns rows to either a warp or a full block based on row length (
spmv_gpu_adaptive_csr.cu). - Hybrid Adaptive: The most advanced kernel, which classifies rows as "short" or "long" and uses a thread-per-row (scalar) or warp-per-row (vector) strategy accordingly (
spmv_gpu_hybrid_adaptive_csr.cu).
To compile all implementations using the default release configuration:
makeOther useful targets are available in the Makefile:
# Build with debug symbols
make debug
# Clean all build artifacts
make cleanThe Makefile is configured for an NVIDIA A30 GPU (sm_80). If you are compiling for a different architecture (e.g., an L40S), you must update the RELEASE_NV_OPT variable in the Makefile. For an L40S, change --gpu-architecture=sm_80 to --gpu-architecture=sm_89, then run make clean && make.
Run the provided script to download and unpack the test matrices into the data/ directory:
./scripts/download_matrices.shTo submit all benchmark jobs to the SLURM scheduler, use the main script:
./scripts/run_all_benchmarks.shYou can run benchmarks for specific implementations using their corresponding scripts (e.g., sbatch scripts/cpu_simple_run.sh, sbatch scripts/run_spmv_hybrid_adaptive.sh).
The repository includes scripts for running parameter sweeps:
- Hybrid Kernel Sweep: Use
scripts/spmv_test.shto test different(threads, threshold)combinations for the hybrid adaptive kernel.
After running the benchmarks, use the extraction scripts to generate CSV files:
-
Main Benchmarks:
./scripts/extract_spmv_data.sh
This script finds all
.outfiles in the root directory and generatesspmv_results_minimal.csv. -
Hybrid Kernel Sweep:
./scripts/extract_test.sh hybrid_adaptive_sweep-[JOB_ID].out
This generates
hybrid_sweep_results.csv.
The benchmarks measure:
- Execution Time (s): Average time per kernel execution.
- Memory Bandwidth (GB/s): Effective memory throughput.
- Computational Performance (GFLOPS): Giga-Floating-Point Operations Per Second.
- CPU: AMD EPYC 9334 @ 2.7GHz (32 Cores / 64 Threads)
- GPU: NVIDIA A30 (24 GB HBM2)
- CUDA Toolkit: 12.5.0