CUDA-Insight-AI is a command-line tool that analyzes CUDA kernels by combining static analysis, optional runtime profiling, and an LLM agent with tool-calling. The goal is to help developers detect potential performance issues and receive optimization suggestions for GPU code (CUDA kernels).
- Helps developers understand GPU performance bottlenecks without deep CUDA expertise
- Provides AI-driven optimization guidance that combines static analysis and runtime metrics
- Bridges traditional developer tools and modern LLM agentic systems for code analysis
- Useful for GPU/AI engineering education and performance optimization workflows
- Python (CLI, orchestration)
- CUDA C/C++ (kernels)
- C++17 (profiler)
- OpenAI API / LLM function calling
- JSON-based tool-calling
- CMake (C++ build)
CUDA-Insight-AI/
├── src/
│ ├── ai/
│ │ ├── llm_agent.py # LLM agent with tool-calling
│ │ └── tool_calling_schema.json # Tool schema for the agent
│ ├── analysis/
│ │ └── static_analyzer.py # Static analyzer for CUDA kernels
│ ├── cli/
│ │ └── main.py # Command-line interface
│ ├── cuda/
│ │ ├── example_kernels/ # Example CUDA kernels
│ │ │ ├── saxpy.cu
│ │ │ ├── vector_add.cu
│ │ │ └── divergent_kernel.cu
│ │ └── runner.py # CUDA kernel runner
│ └── profiling/
│ ├── profiler.cpp # C++ profiler for runtime metrics
│ ├── profiler_wrapper.py # Python wrapper for the profiler
│ └── CMakeLists.txt # Build configuration
├── tests/ # Unit tests
├── examples/ # Usage examples
├── report/ # LaTeX report
└── requirements.txt # Python dependencies
The analysis pipeline follows three main steps:
CUDA (.cu file)
│
▼
Static Analyzer ────► JSON (analysis)
│
▼
Profiler (opt) ─► JSON (metrics)
│
▼
LLM Agent ─────► Final Report
The static analyzer (src/analysis/static_analyzer.py) inspects the CUDA source file without executing it. It detects:
- Kernel definitions (
__global__functions) - Thread indexing patterns (threadIdx, blockIdx, blockDim)
- Simple patterns that may cause warp divergence
- Memory access patterns (e.g., a[i], a[i + stride])
The analyzer returns a JSON dictionary containing the extracted information, usable by the LLM agent.
The profiler (src/profiling/profiler_wrapper.py) measures runtime performance of the kernel when a compatible NVIDIA GPU and CUDA environment are available. It provides:
- Kernel execution time
- Other performance metrics if available
If no GPU is available, the profiler can operate in mock mode to allow testing of the rest of the pipeline.
The LLM agent (src/ai/llm_agent.py) is responsible for:
- Calling tools (static analyzer and profiler)
- Interpreting JSON results
- Generating a human-readable analysis report
The agent uses tool-calling (e.g., OpenAI function calling) to orchestrate the analysis and produces a structured report including:
- Summary of detected kernels
- Identified issues (static analysis and profiling)
- Optimization suggestions
- Optional improved code
- Python 3.8 or higher
- CUDA Toolkit (optional, required only for profiling)
- Compatible NVIDIA GPU (optional, required only for profiling)
- OpenAI API key (required for LLM agent, except in mock mode)
pip install -r requirements.txtTo use the LLM agent, set your OpenAI API key:
export OPENAI_API_KEY="your-api-key"On Windows PowerShell:
$env:OPENAI_API_KEY="your-api-key"python -m src.cli.main --kernel src/cuda/example_kernels/saxpy.cupython -m src.cli.main --kernel src/cuda/example_kernels/saxpy.cu --profilepython -m src.cli.main --kernel src/cuda/example_kernels/saxpy.cu --save-report report.txtpython -m src.cli.main --kernel src/cuda/example_kernels/saxpy.cu --save-report report.mdpython -m src.cli.main --kernel src/cuda/example_kernels/saxpy.cu --mockpython -m src.cli.main --kernel src/cuda/example_kernels/saxpy.cu --model gpt-4python -m src.cli.main --kernel src/cuda/example_kernels/saxpy.cu --api-key your-api-keyHere is an example of a simple CUDA kernel (SAXPY):
#include <cuda_runtime.h>
// SAXPY kernel: y = a * x + y
// Single-precision A times X Plus Y
__global__ void saxpy(float a, float* x, float* y, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
y[i] = a * x[i] + y[i];
}
}This kernel performs a SAXPY (Scalar Alpha X Plus Y) operation on vectors. The static analyzer will detect:
- Standard 1D indexing pattern
- Coalesced memory access (consecutive accesses)
- Simple bounds check (i < n) that does not cause significant divergence
- Profiling requires a compatible NVIDIA GPU and CUDA Toolkit installed
- The LLM agent requires a valid OpenAI API key (or mock mode for testing)
- Static analysis is limited to common patterns and may not detect all performance issues
- The profiler may require separate compilation of C++ code
Run tests with pytest:
pytest .For tests with coverage:
pytest --cov=src tests/See the LICENSE file for details.