Skip to content

Kernel-Heim/forge-cute-py

Repository files navigation

forge-cute-py is a project for developing and evaluating CuTe DSL kernels in Python.

As initially planned, it should provides a workflow to run kernels, validate correctness against PyTorch references, benchmark performance, and profile.

Current scope (v0.1)

Target kernels aligned to KernelHeim Weeks 0–2:

  • Week 0: tiled copy / transpose
  • Week 1: reductions (sum) with multiple implementations (e.g., naive -> improved -> shuffle)
  • Week 2: single-pass online softmax

Not currently in scope for v0: FlashAttention kernels (FA1+), decode/KV-cache, FP8, distributed/NCCL, C++ extension builds.


Requirements

  • Linux + NVIDIA GPU (CUDA-capable)
  • Python (managed via uv)
  • PyTorch installed with CUDA support
  • Recommended tooling for profiling:
    • Nsight Compute (ncu)
    • Nsight Systems (nsys)
    • compute-sanitizer

Install (uv)

uv sync

If you need an editable/dev install, use your normal uv workflow (project is expected to be runnable via uv run ...).


Sanity check

uv run python -m forge_cute_py.env_check

This should validate CUDA/PyTorch visibility and basic runtime assumptions.


Correctness tests (PyTorch reference-gated)

uv run pytest -q

Correctness is the primary gate for changes: kernels must match reference behavior within defined tolerances.


User guide (quickstart)

Run a single op in Python:

uv run python - <<'PY'
import torch
from forge_cute_py.ops import copy_transpose

x = torch.randn(1024, 1024, device="cuda", dtype=torch.float16)
y = copy_transpose(x, tile_size=16)
print(y.shape)  # torch.Size([1024, 1024])
PY

Ops are also accessible via torch.ops.forge_cute_py._op_name() (note the underscore prefix) for direct custom op access.

Run a smoke benchmark suite (JSON output):

uv run python bench/run.py --suite smoke --out results.json

Run a standalone benchmark:

uv run python bench/benchmark_copy_transpose.py --tile-size 16

Profile a kernel with helper script:

./scripts/profile.sh ncu -- uv run python bench/benchmark_copy_transpose.py
./scripts/profile.sh nsys -- uv run pytest tests/test_copy_transpose.py -k "tile_size=16"

Or use profiling tools directly:

ncu --set full -o profiles/copy_transpose uv run python bench/benchmark_copy_transpose.py

Kernel status (v0.1)

Op Status Variants Notes
copy_transpose Implemented tile_size=16/32 CuTe DSL kernel with tiled shared memory
reduce_sum Stub (ref) naive/improved/shfl Uses PyTorch reference; kernel to be implemented
softmax_online Stub (ref) single-pass Uses PyTorch reference with autograd support; kernel to be implemented

Package layout (high level)

  • forge_cute_py/ops/ Python-facing op wrappers, input validation, optional torch.library registration.
  • forge_cute_py/kernels/ CuTe DSL kernel implementations (organized by week/op).
  • forge_cute_py/ref/ Reference implementations (PyTorch) used by tests and validation.
  • tests/ Environment checks + correctness tests (pytest).
  • bench/ Benchmark CLI, suites, and JSON reporting.
  • scripts/ Profiling and sanitizer runners (ncu, nsys, compute-sanitizer).

Quick Reference

Setup and Validation

uv sync                                    # Install dependencies
uv run python -m forge_cute_py.env_check  # Verify CUDA/PyTorch setup

Testing

uv run pytest -q                                      # Run all tests
uv run pytest tests/test_copy_transpose.py            # Run specific test file
uv run pytest -k "float16 and tile_size=16"           # Run filtered tests
uv run pre-commit run --all-files                     # Run linting/formatting

Benchmarking

uv run python bench/run.py --suite smoke              # Run benchmark suite
uv run python bench/run.py --suite smoke --out out.json  # Save results
uv run python bench/benchmark_copy_transpose.py       # Standalone benchmark

Profiling

./scripts/profile.sh ncu -- uv run python bench/benchmark_copy_transpose.py  # Nsight Compute
./scripts/profile.sh nsys -- uv run pytest tests/test_copy_transpose.py      # Nsight Systems
./scripts/profile.sh sanitizer -- uv run python -m forge_cute_py.env_check   # Memory check

Contributing

Note: We are not accepting unsolicited pull requests during v0 stabilization. Please open an issue first and wait for maintainer approval before starting work.

See CONTRIBUTING.md for full details.

For kernel development workflow and architecture details, see:


Roadmap (v0.1 completion)

  • Week 0 copy/transpose: end-to-end correctness + benchmark + profile scripts
  • Week 1 reductions: multiple variants, correctness + benchmark coverage
  • Week 2 online softmax: correctness + benchmark coverage + profiling notes
  • CI: run correctness on supported GPU runners; optional perf smoke checks

See ROADMAP.md for detailed breakdown and progress tracking.

About

a repository for forging kernels in CuTe DSL

Resources

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •