Skip to content

CUDA C++ practice project for RTX 4070 SUPER — explore GPU concurrency, pinned memory, and Nsight profiling. Includes SAXPY and 2D blur kernels to train optimization, stream overlap, and timing analysis for NVIDIA Developer Technology Engineering skillset.

Notifications You must be signed in to change notification settings

FlosMume/cpp-cuda-deepvision-rtx-starter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeepVision-RTX (Starter)

A CUDA C++ practice project designed for the RTX 4070 SUPER (Ada 8.9), demonstrating how to overlap data transfers with computation using streams and pinned memory, apply basic kernel optimizations with 1D and 2D grid configurations, and perform precise event timing for profiling in Nsight Systems and Nsight Compute.

What’s here?

  • Pinned host memory + cudaMemcpyAsync to demonstrate overlap
  • Multiple streams for concurrent copy/compute
  • Timed sections with cudaEventRecord / cudaEventElapsedTime
  • Kernels: saxpy (1D), blur3x3_naive (2D)

Build

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
./build/deepvision_rtx

Profile (examples)

# Nsight Systems GUI (on host with CUDA Toolkit installed)
nsys profile -o nsys_report ./build/deepvision_rtx

# Nsight Compute single-kernel collection
ncu --set full --target-processes all ./build/deepvision_rtx

Next steps

  • Convert blur kernel to shared-memory tiled version
  • Add half-precision path to prep for Tensor Cores
  • Compare end-to-end with cuDNN and optionally TensorRT

About

CUDA C++ practice project for RTX 4070 SUPER — explore GPU concurrency, pinned memory, and Nsight profiling. Includes SAXPY and 2D blur kernels to train optimization, stream overlap, and timing analysis for NVIDIA Developer Technology Engineering skillset.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published