feat: add --profile flag for LLM-readable CUDA kernel summary by Bortlesboat · Pull Request #279 · karpathy/autoresearch

Bortlesboat · 2026-03-15T08:43:20Z

Problem

Addresses #118. The AI agent has no visibility into which CUDA kernels are the actual hardware bottlenecks. It can only infer from step time and MFU%, which are too coarse to guide targeted optimization (e.g., "is attention or MLP the bottleneck?").

Solution

Add a --profile flag to train.py that:

Runs torch.profiler over a small number of warmup steps (2 warmup + 5 active)
Prints a Markdown table of the top CUDA kernels sorted by self-time
Exits cleanly — does not consume any of TIME_BUDGET

Usage:

uv run train.py --profile

Example output:

## Top 15 CUDA kernels (5 profiled steps)

| Rank | Kernel | Self CUDA (ms) | % of Total | Calls |
|------|--------|---------------|------------|-------|
| 1 | `flash_attn_varlen_fwd` | 1842.31 | 38.2% | 35 |
| 2 | `volta_sgemm_128x64_nn` | 1204.17 | 24.9% | 70 |
| 3 | `elementwise_kernel` | 412.88 | 8.6% | 140 |
...

_Total CUDA time: 4.823s over 5 steps_

This lets the agent immediately see, for example, that FlashAttention dominates at 38% — and decide whether to explore window size changes vs. MLP width changes.

Changes

train.py only (the single file agents are allowed to modify)
+sys, +argparse imports (~2 lines)
argparse.parse_known_args() block (~5 lines, add_help=False so it doesn't conflict with uv run)
run_profiler() function (~35 lines)
Guard before training loop: if PROFILE_MODE: run_profiler(...); sys.exit(0) (~4 lines)

No changes to prepare.py, program.md, or any other file. Normal training runs are completely unaffected.

Add a --profile CLI flag to train.py that runs torch.profiler over a few warmup steps and prints a Markdown table of the top CUDA kernels by self-time. This lets an AI agent identify hardware bottlenecks (FlashAttention vs MLP vs other kernels) without needing trace visualization tools. Usage: uv run train.py --profile Exits after printing the table without consuming the TIME_BUDGET. Closes karpathy#118

…iler, UCB1 search PR karpathy#287 — Gradient clipping before optimizer step (baseline/train.py) Adds GRAD_CLIP_NORM hyperparameter (default 1.0, set 0.0 to disable). relu² activations can produce gradient spikes that silently degrade weights. The existing loss > 100 fast-fail only catches damage after it has already happened. Clipping prevents wasted experiment runs. PR karpathy#279 — --profile flag for LLM-readable CUDA kernel summary (baseline/train.py) Adds argparse --profile flag that runs torch.profiler over a few warmup steps and prints a Markdown table of top CUDA kernels by self-time, then exits. Lets the agent identify hardware bottlenecks (attention vs MLP vs elementwise) without needing trace visualization tools. Usage: uv run baseline/train.py --profile Issue karpathy#284 — DUSE alt program (baseline/program-alt.md) Alternative program.md integrating Dimensional UCB1 Search + Experiment Memory from issue karpathy#284. Adds: 7-dimension map, experiments.json structured memory, UCB1 dimension selector (exploration vs exploitation), 90-second early abort gate, rescue pool for recombining discarded sub-mechanisms. Pure prompt change, no code modifications required. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add --profile flag for LLM-readable CUDA kernel summary#279

feat: add --profile flag for LLM-readable CUDA kernel summary#279
Bortlesboat wants to merge 1 commit intokarpathy:masterfrom
Bortlesboat:feat/llm-readable-profiler

Bortlesboat commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Bortlesboat commented Mar 15, 2026

Problem

Solution

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant