forge-cute-py is a project for developing and evaluating CuTe DSL kernels in Python.
As initially planned, it should provides a workflow to run kernels, validate correctness against PyTorch references, benchmark performance, and profile.
Target kernels aligned to KernelHeim Weeks 0–2:
- Week 0: tiled copy / transpose
- Week 1: reductions (sum) with multiple implementations (e.g., naive -> improved -> shuffle)
- Week 2: single-pass online softmax
Not currently in scope for v0: FlashAttention kernels (FA1+), decode/KV-cache, FP8, distributed/NCCL, C++ extension builds.
- Linux + NVIDIA GPU (CUDA-capable)
- Python (managed via
uv) - PyTorch installed with CUDA support
- Recommended tooling for profiling:
- Nsight Compute (
ncu) - Nsight Systems (
nsys) - compute-sanitizer
- Nsight Compute (
uv syncIf you need an editable/dev install, use your normal uv workflow (project is expected to be runnable via uv run ...).
uv run python -m forge_cute_py.env_checkThis should validate CUDA/PyTorch visibility and basic runtime assumptions.
uv run pytest -qCorrectness is the primary gate for changes: kernels must match reference behavior within defined tolerances.
Run a single op in Python:
uv run python - <<'PY'
import torch
from forge_cute_py.ops import copy_transpose
x = torch.randn(1024, 1024, device="cuda", dtype=torch.float16)
y = copy_transpose(x, tile_size=16)
print(y.shape) # torch.Size([1024, 1024])
PYOps are also accessible via torch.ops.forge_cute_py._op_name() (note the
underscore prefix) for direct custom op access.
Run a smoke benchmark suite (JSON output):
uv run python bench/run.py --suite smoke --out results.jsonRun a standalone benchmark:
uv run python bench/benchmark_copy_transpose.py --tile-size 16Profile a kernel with helper script:
./scripts/profile.sh ncu -- uv run python bench/benchmark_copy_transpose.py
./scripts/profile.sh nsys -- uv run pytest tests/test_copy_transpose.py -k "tile_size=16"Or use profiling tools directly:
ncu --set full -o profiles/copy_transpose uv run python bench/benchmark_copy_transpose.py| Op | Status | Variants | Notes |
|---|---|---|---|
| copy_transpose | Implemented | tile_size=16/32 | CuTe DSL kernel with tiled shared memory |
| reduce_sum | Stub (ref) | naive/improved/shfl | Uses PyTorch reference; kernel to be implemented |
| softmax_online | Stub (ref) | single-pass | Uses PyTorch reference with autograd support; kernel to be implemented |
forge_cute_py/ops/Python-facing op wrappers, input validation, optionaltorch.libraryregistration.forge_cute_py/kernels/CuTe DSL kernel implementations (organized by week/op).forge_cute_py/ref/Reference implementations (PyTorch) used by tests and validation.tests/Environment checks + correctness tests (pytest).bench/Benchmark CLI, suites, and JSON reporting.scripts/Profiling and sanitizer runners (ncu,nsys, compute-sanitizer).
uv sync # Install dependencies
uv run python -m forge_cute_py.env_check # Verify CUDA/PyTorch setupuv run pytest -q # Run all tests
uv run pytest tests/test_copy_transpose.py # Run specific test file
uv run pytest -k "float16 and tile_size=16" # Run filtered tests
uv run pre-commit run --all-files # Run linting/formattinguv run python bench/run.py --suite smoke # Run benchmark suite
uv run python bench/run.py --suite smoke --out out.json # Save results
uv run python bench/benchmark_copy_transpose.py # Standalone benchmark./scripts/profile.sh ncu -- uv run python bench/benchmark_copy_transpose.py # Nsight Compute
./scripts/profile.sh nsys -- uv run pytest tests/test_copy_transpose.py # Nsight Systems
./scripts/profile.sh sanitizer -- uv run python -m forge_cute_py.env_check # Memory checkNote: We are not accepting unsolicited pull requests during v0 stabilization. Please open an issue first and wait for maintainer approval before starting work.
See CONTRIBUTING.md for full details.
For kernel development workflow and architecture details, see:
- DEVELOPMENT.md - Kernel development guide
- Week 0 copy/transpose: end-to-end correctness + benchmark + profile scripts
- Week 1 reductions: multiple variants, correctness + benchmark coverage
- Week 2 online softmax: correctness + benchmark coverage + profiling notes
- CI: run correctness on supported GPU runners; optional perf smoke checks
See ROADMAP.md for detailed breakdown and progress tracking.