GitHub - Kernel-Heim/forge-cute-py: a repository for forging kernels in CuTe DSL

forge-cute-py is a project for developing and evaluating CuTe DSL kernels in Python.

As initially planned, it should provides a workflow to run kernels, validate correctness against PyTorch references, benchmark performance, and profile.

Current scope (v0.1)

Target kernels aligned to KernelHeim Weeks 0–2:

Week 0: tiled copy / transpose
Week 1: reductions (sum) with multiple implementations (e.g., naive -> improved -> shuffle)
Week 2: single-pass online softmax

Not currently in scope for v0: FlashAttention kernels (FA1+), decode/KV-cache, FP8, distributed/NCCL, C++ extension builds.

Requirements

Linux + NVIDIA GPU (CUDA-capable)
Python (managed via uv)
PyTorch installed with CUDA support
Recommended tooling for profiling:
- Nsight Compute (ncu)
- Nsight Systems (nsys)
- compute-sanitizer

Install (uv)

uv sync

If you need an editable/dev install, use your normal uv workflow (project is expected to be runnable via uv run ...).

Sanity check

uv run python -m forge_cute_py.env_check

This should validate CUDA/PyTorch visibility and basic runtime assumptions.

Correctness tests (PyTorch reference-gated)

uv run pytest -q

Correctness is the primary gate for changes: kernels must match reference behavior within defined tolerances.

User guide (quickstart)

Run a single op in Python:

uv run python - <<'PY'
import torch
from forge_cute_py.ops import copy_transpose

x = torch.randn(1024, 1024, device="cuda", dtype=torch.float16)
y = copy_transpose(x, tile_size=16)
print(y.shape)  # torch.Size([1024, 1024])
PY

Ops are also accessible via torch.ops.forge_cute_py._op_name() (note the underscore prefix) for direct custom op access.

Run a smoke benchmark suite (JSON output):

uv run python bench/run.py --suite smoke --out results.json

Run a standalone benchmark:

uv run python bench/benchmark_copy_transpose.py --tile-size 16

Profile a kernel with helper script:

./scripts/profile.sh ncu -- uv run python bench/benchmark_copy_transpose.py
./scripts/profile.sh nsys -- uv run pytest tests/test_copy_transpose.py -k "tile_size=16"

Or use profiling tools directly:

ncu --set full -o profiles/copy_transpose uv run python bench/benchmark_copy_transpose.py

Kernel status (v0.1)

Op	Status	Variants	Notes
copy_transpose	Implemented	tile_size=16/32	CuTe DSL kernel with tiled shared memory
reduce_sum	Stub (ref)	naive/improved/shfl	Uses PyTorch reference; kernel to be implemented
softmax_online	Stub (ref)	single-pass	Uses PyTorch reference with autograd support; kernel to be implemented

Package layout (high level)

forge_cute_py/ops/ Python-facing op wrappers, input validation, optional torch.library registration.
forge_cute_py/kernels/ CuTe DSL kernel implementations (organized by week/op).
forge_cute_py/ref/ Reference implementations (PyTorch) used by tests and validation.
tests/ Environment checks + correctness tests (pytest).
bench/ Benchmark CLI, suites, and JSON reporting.
scripts/ Profiling and sanitizer runners (ncu, nsys, compute-sanitizer).

Quick Reference

Setup and Validation

uv sync                                    # Install dependencies
uv run python -m forge_cute_py.env_check  # Verify CUDA/PyTorch setup

Testing

uv run pytest -q                                      # Run all tests
uv run pytest tests/test_copy_transpose.py            # Run specific test file
uv run pytest -k "float16 and tile_size=16"           # Run filtered tests
uv run pre-commit run --all-files                     # Run linting/formatting

Benchmarking

uv run python bench/run.py --suite smoke              # Run benchmark suite
uv run python bench/run.py --suite smoke --out out.json  # Save results
uv run python bench/benchmark_copy_transpose.py       # Standalone benchmark

Profiling

./scripts/profile.sh ncu -- uv run python bench/benchmark_copy_transpose.py  # Nsight Compute
./scripts/profile.sh nsys -- uv run pytest tests/test_copy_transpose.py      # Nsight Systems
./scripts/profile.sh sanitizer -- uv run python -m forge_cute_py.env_check   # Memory check

Contributing

Note: We are not accepting unsolicited pull requests during v0 stabilization. Please open an issue first and wait for maintainer approval before starting work.

See CONTRIBUTING.md for full details.

For kernel development workflow and architecture details, see:

DEVELOPMENT.md - Kernel development guide

Roadmap (v0.1 completion)

Week 0 copy/transpose: end-to-end correctness + benchmark + profile scripts
Week 1 reductions: multiple variants, correctness + benchmark coverage
Week 2 online softmax: correctness + benchmark coverage + profiling notes
CI: run correctness on supported GPU runners; optional perf smoke checks

See ROADMAP.md for detailed breakdown and progress tracking.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github		.github
bench		bench
forge_cute_py		forge_cute_py
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
DEVELOPMENT.md		DEVELOPMENT.md
README.md		README.md
ROADMAP.md		ROADMAP.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Current scope (v0.1)

Requirements

Install (uv)

Sanity check

Correctness tests (PyTorch reference-gated)

User guide (quickstart)

Kernel status (v0.1)

Package layout (high level)

Quick Reference

Setup and Validation

Testing

Benchmarking

Profiling

Contributing

Roadmap (v0.1 completion)

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Kernel-Heim/forge-cute-py

Folders and files

Latest commit

History

Repository files navigation

Current scope (v0.1)

Requirements

Install (uv)

Sanity check

Correctness tests (PyTorch reference-gated)

User guide (quickstart)

Kernel status (v0.1)

Package layout (high level)

Quick Reference

Setup and Validation

Testing

Benchmarking

Profiling

Contributing

Roadmap (v0.1 completion)

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages