DeepVision-RTX Starter

This is an enhanced README combining:

GitHub-friendly formatting
Technical deep dive
ASCII diagrams
Rendered PNG diagrams
Profiling workflow
Architecture explanations
GPU hardware reasoning

Architecture Diagram (SVG)

Streams Overlap Diagram (SVG)

ASCII Architecture Diagram

   +-------------------+      PCIe / DMA      +---------------------+
   |      CPU Host     |  ----------------->  |     GPU Global      |
   |  (Pinned Memory)  |  <-----------------  |      Memory         |
   +-------------------+                      +---------------------+
            |                                           |
            | Launch Kernels                            |
            v                                           v
      +--------------+                          +------------------+
      | CUDA Driver  |                          |  SMs (Ada 8.9)   |
      | Runtime API  |                          |  Warps / Threads |
      +--------------+                          +------------------+

ASCII Streams Overlap Diagram

Time →
----------------------------------------------------------------------
H2D Stream 0: [========== copy =========]
H2D Stream 1:         [========== copy =========]
Compute Stream 2:             [==== kernel ====]
D2H Stream 0:                           [==== copy ====]
----------------------------------------------------------------------

Build Instructions

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
./build/deepvision_rtx

Profiling Instructions (Nsight Systems & Nsight Compute)

nsys profile -o nsys_report ./build/deepvision_rtx
ncu --set full ./build/deepvision_rtx

Project Structure

src/
   main.cpp            # Entry point
   main.cu             # Separate experimental demo
   conv_kernels.cu     # SAXPY + blur3x3 kernels
   conv_kernels.cuh    # Kernel declarations
   utils/
       check_cuda.hpp  # Error-checking utilities

GPU Architecture Notes (RTX 4070 SUPER - Ada 8.9)

SM count: 46
Warp size: 32
Max threads/block: 1024
Memory bandwidth: 504 GB/s
Concurrent copy/compute supported
Best performance achieved when:
- You use pinned memory
- H2D and D2H overlap with compute
- Kernels maintain good occupancy

Roadmap

Add shared-memory tiled blur
Add constant-memory kernel variants
Add half-precision path (FP16)
Add Tensor Core WMMA version
Add occupancy analysis + roofline plot
Add Nsight Compute performance tables
Add multi-kernel pipelines
Compare against cuDNN for 3×3 conv

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
docs		docs
scripts		scripts
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DeepVision-RTX Starter

Architecture Diagram (SVG)

Streams Overlap Diagram (SVG)

ASCII Architecture Diagram

ASCII Streams Overlap Diagram

Build Instructions

Profiling Instructions (Nsight Systems & Nsight Compute)

Project Structure

GPU Architecture Notes (RTX 4070 SUPER - Ada 8.9)

Roadmap

About

Uh oh!

Releases

Packages

Languages

FlosMume/cpp-cuda-deepvision-rtx-starter

Folders and files

Latest commit

History

Repository files navigation

DeepVision-RTX Starter

Architecture Diagram (SVG)

Streams Overlap Diagram (SVG)

ASCII Architecture Diagram

ASCII Streams Overlap Diagram

Build Instructions

Profiling Instructions (Nsight Systems & Nsight Compute)

Project Structure

GPU Architecture Notes (RTX 4070 SUPER - Ada 8.9)

Roadmap

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages