GPU-Computing-2025-256137

This repository contains various implementations of Sparse Matrix-Vector Multiplication (SpMV) for both CPU and GPU, developed as part of the GPU Computing graduate course at the University of Trento.

Repository Structure

├── bin/                    # Compiled executables
├── data/                   # Matrix market test files
├── include/                # Header files for types and utilities
├── lib/                    # Library implementation files (kernels, utils)
├── obj/                    # Object files
├── results/                # CSV files generated by analysis scripts
├── scripts/                # Benchmark, download, and analysis scripts
├── src/                    # Source code for all executables
│   ├── spmv_cpu_*.c        # CPU implementations
│   └── spmv_gpu_*.cu       # GPU implementations
├── deviceQuery/            # NVIDIA device information utility
└── test/                   # Experimental code and cuSPARSE implementation

The test/ directory contains experimental code and examples developed during lab sessions. It also contains the cuSPARSE implementation, which can be compiled using test/compile.sh.

Implementations

CPU Implementations

Simple CSR: A basic, single-threaded row-per-thread implementation (spmv_cpu_csr.c).
ILP: A version optimized with manual loop unrolling to exploit instruction-level parallelism (spmv_cpu_csr_ilp.c).

GPU Implementations

Simple: A basic row-per-thread kernel (spmv_gpu_simple_csr.cu).
Value Sequential: A value-per-thread kernel using atomic adds, inefficient but illustrative (spmv_gpu_value_sequential_csr.cu).
Value Blocked: An improved value-parallel kernel with strided access (spmv_gpu_value_blocked_csr.cu).
Vector (Warp-per-Row): A kernel that assigns one warp to process each row (spmv_gpu_vector_csr.cu).
Vector Double Buffer: An optimized vector kernel that processes two rows per warp to improve occupancy (spmv_gpu_vector_test_csr.cu).
Adaptive Row Blocks: A kernel that dynamically assigns rows to either a warp or a full block based on row length (spmv_gpu_adaptive_csr.cu).
Hybrid Adaptive: The most advanced kernel, which classifies rows as "short" or "long" and uses a thread-per-row (scalar) or warp-per-row (vector) strategy accordingly (spmv_gpu_hybrid_adaptive_csr.cu).

How to Compile

To compile all implementations using the default release configuration:

make

Other useful targets are available in the Makefile:

# Build with debug symbols
make debug

# Clean all build artifacts
make clean

Note on GPU Architecture

The Makefile is configured for an NVIDIA A30 GPU (sm_80). If you are compiling for a different architecture (e.g., an L40S), you must update the RELEASE_NV_OPT variable in the Makefile. For an L40S, change --gpu-architecture=sm_80 to --gpu-architecture=sm_89, then run make clean && make.

How to Download the Sparse Matrices

Run the provided script to download and unpack the test matrices into the data/ directory:

./scripts/download_matrices.sh

How to Run Benchmarks

Running All Benchmarks

To submit all benchmark jobs to the SLURM scheduler, use the main script:

./scripts/run_all_benchmarks.sh

Running Individual Implementations

You can run benchmarks for specific implementations using their corresponding scripts (e.g., sbatch scripts/cpu_simple_run.sh, sbatch scripts/run_spmv_hybrid_adaptive.sh).

Running Experiments

The repository includes scripts for running parameter sweeps:

Hybrid Kernel Sweep: Use scripts/spmv_test.sh to test different (threads, threshold) combinations for the hybrid adaptive kernel.

Data Analysis

After running the benchmarks, use the extraction scripts to generate CSV files:

Main Benchmarks:
```
./scripts/extract_spmv_data.sh
```
This script finds all .out files in the root directory and generates spmv_results_minimal.csv.

Hybrid Kernel Sweep:

./scripts/extract_test.sh hybrid_adaptive_sweep-[JOB_ID].out

This generates hybrid_sweep_results.csv.

Performance Metrics

The benchmarks measure:

Execution Time (s): Average time per kernel execution.
Memory Bandwidth (GB/s): Effective memory throughput.
Computational Performance (GFLOPS): Giga-Floating-Point Operations Per Second.

Hardware and Software Used

CPU: AMD EPYC 9334 @ 2.7GHz (32 Cores / 64 Threads)
GPU: NVIDIA A30 (24 GB HBM2)
CUDA Toolkit: 12.5.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPU-Computing-2025-256137

Repository Structure

Implementations

CPU Implementations

GPU Implementations

How to Compile

Note on GPU Architecture

How to Download the Sparse Matrices

How to Run Benchmarks

Running All Benchmarks

Running Individual Implementations

Running Experiments

Data Analysis

Performance Metrics

Hardware and Software Used

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
deviceQuery		deviceQuery
include		include
lib		lib
results		results
scripts		scripts
src		src
test		test
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

GPU-Computing-2025-256137

Repository Structure

Implementations

CPU Implementations

GPU Implementations

How to Compile

Note on GPU Architecture

How to Download the Sparse Matrices

How to Run Benchmarks

Running All Benchmarks

Running Individual Implementations

Running Experiments

Data Analysis

Performance Metrics

Hardware and Software Used

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages