Skip to content

MaiDormo/GPU-Computing-2025-256137

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPU-Computing-2025-256137

This repository contains various implementations of Sparse Matrix-Vector Multiplication (SpMV) for both CPU and GPU, developed as part of the GPU Computing graduate course at the University of Trento.

Repository Structure

├── bin/                    # Compiled executables
├── data/                   # Matrix market test files
├── include/                # Header files for types and utilities
├── lib/                    # Library implementation files (kernels, utils)
├── obj/                    # Object files
├── results/                # CSV files generated by analysis scripts
├── scripts/                # Benchmark, download, and analysis scripts
├── src/                    # Source code for all executables
│   ├── spmv_cpu_*.c        # CPU implementations
│   └── spmv_gpu_*.cu       # GPU implementations
├── deviceQuery/            # NVIDIA device information utility
└── test/                   # Experimental code and cuSPARSE implementation

The test/ directory contains experimental code and examples developed during lab sessions. It also contains the cuSPARSE implementation, which can be compiled using test/compile.sh.

Implementations

CPU Implementations

  • Simple CSR: A basic, single-threaded row-per-thread implementation (spmv_cpu_csr.c).
  • ILP: A version optimized with manual loop unrolling to exploit instruction-level parallelism (spmv_cpu_csr_ilp.c).

GPU Implementations

  • Simple: A basic row-per-thread kernel (spmv_gpu_simple_csr.cu).
  • Value Sequential: A value-per-thread kernel using atomic adds, inefficient but illustrative (spmv_gpu_value_sequential_csr.cu).
  • Value Blocked: An improved value-parallel kernel with strided access (spmv_gpu_value_blocked_csr.cu).
  • Vector (Warp-per-Row): A kernel that assigns one warp to process each row (spmv_gpu_vector_csr.cu).
  • Vector Double Buffer: An optimized vector kernel that processes two rows per warp to improve occupancy (spmv_gpu_vector_test_csr.cu).
  • Adaptive Row Blocks: A kernel that dynamically assigns rows to either a warp or a full block based on row length (spmv_gpu_adaptive_csr.cu).
  • Hybrid Adaptive: The most advanced kernel, which classifies rows as "short" or "long" and uses a thread-per-row (scalar) or warp-per-row (vector) strategy accordingly (spmv_gpu_hybrid_adaptive_csr.cu).

How to Compile

To compile all implementations using the default release configuration:

make

Other useful targets are available in the Makefile:

# Build with debug symbols
make debug

# Clean all build artifacts
make clean

Note on GPU Architecture

The Makefile is configured for an NVIDIA A30 GPU (sm_80). If you are compiling for a different architecture (e.g., an L40S), you must update the RELEASE_NV_OPT variable in the Makefile. For an L40S, change --gpu-architecture=sm_80 to --gpu-architecture=sm_89, then run make clean && make.

How to Download the Sparse Matrices

Run the provided script to download and unpack the test matrices into the data/ directory:

./scripts/download_matrices.sh

How to Run Benchmarks

Running All Benchmarks

To submit all benchmark jobs to the SLURM scheduler, use the main script:

./scripts/run_all_benchmarks.sh

Running Individual Implementations

You can run benchmarks for specific implementations using their corresponding scripts (e.g., sbatch scripts/cpu_simple_run.sh, sbatch scripts/run_spmv_hybrid_adaptive.sh).

Running Experiments

The repository includes scripts for running parameter sweeps:

  • Hybrid Kernel Sweep: Use scripts/spmv_test.sh to test different (threads, threshold) combinations for the hybrid adaptive kernel.

Data Analysis

After running the benchmarks, use the extraction scripts to generate CSV files:

  • Main Benchmarks:

    ./scripts/extract_spmv_data.sh

    This script finds all .out files in the root directory and generates spmv_results_minimal.csv.

  • Hybrid Kernel Sweep:

    ./scripts/extract_test.sh hybrid_adaptive_sweep-[JOB_ID].out

    This generates hybrid_sweep_results.csv.

Performance Metrics

The benchmarks measure:

  • Execution Time (s): Average time per kernel execution.
  • Memory Bandwidth (GB/s): Effective memory throughput.
  • Computational Performance (GFLOPS): Giga-Floating-Point Operations Per Second.

Hardware and Software Used

  • CPU: AMD EPYC 9334 @ 2.7GHz (32 Cores / 64 Threads)
  • GPU: NVIDIA A30 (24 GB HBM2)
  • CUDA Toolkit: 12.5.0

About

GPU Computing graduate course at the University of Trento, academic year 2024/2025

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors