Skip to content

feat: add --profile flag for LLM-readable CUDA kernel summary#279

Open
Bortlesboat wants to merge 1 commit intokarpathy:masterfrom
Bortlesboat:feat/llm-readable-profiler
Open

feat: add --profile flag for LLM-readable CUDA kernel summary#279
Bortlesboat wants to merge 1 commit intokarpathy:masterfrom
Bortlesboat:feat/llm-readable-profiler

Conversation

@Bortlesboat
Copy link

Problem

Addresses #118. The AI agent has no visibility into which CUDA kernels are the actual hardware bottlenecks. It can only infer from step time and MFU%, which are too coarse to guide targeted optimization (e.g., "is attention or MLP the bottleneck?").

Solution

Add a --profile flag to train.py that:

  1. Runs torch.profiler over a small number of warmup steps (2 warmup + 5 active)
  2. Prints a Markdown table of the top CUDA kernels sorted by self-time
  3. Exits cleanly — does not consume any of TIME_BUDGET

Usage:

uv run train.py --profile

Example output:

## Top 15 CUDA kernels (5 profiled steps)

| Rank | Kernel | Self CUDA (ms) | % of Total | Calls |
|------|--------|---------------|------------|-------|
| 1 | `flash_attn_varlen_fwd` | 1842.31 | 38.2% | 35 |
| 2 | `volta_sgemm_128x64_nn` | 1204.17 | 24.9% | 70 |
| 3 | `elementwise_kernel` | 412.88 | 8.6% | 140 |
...

_Total CUDA time: 4.823s over 5 steps_

This lets the agent immediately see, for example, that FlashAttention dominates at 38% — and decide whether to explore window size changes vs. MLP width changes.

Changes

  • train.py only (the single file agents are allowed to modify)
  • +sys, +argparse imports (~2 lines)
  • argparse.parse_known_args() block (~5 lines, add_help=False so it doesn't conflict with uv run)
  • run_profiler() function (~35 lines)
  • Guard before training loop: if PROFILE_MODE: run_profiler(...); sys.exit(0) (~4 lines)

No changes to prepare.py, program.md, or any other file. Normal training runs are completely unaffected.

Add a --profile CLI flag to train.py that runs torch.profiler over a few
warmup steps and prints a Markdown table of the top CUDA kernels by self-time.
This lets an AI agent identify hardware bottlenecks (FlashAttention vs MLP vs
other kernels) without needing trace visualization tools.

Usage: uv run train.py --profile
Exits after printing the table without consuming the TIME_BUDGET.

Closes karpathy#118
IgorTavcar added a commit to IgorTavcar/autoresearch that referenced this pull request Mar 17, 2026
…iler, UCB1 search

PR karpathy#287 — Gradient clipping before optimizer step (baseline/train.py)
  Adds GRAD_CLIP_NORM hyperparameter (default 1.0, set 0.0 to disable).
  relu² activations can produce gradient spikes that silently degrade
  weights. The existing loss > 100 fast-fail only catches damage after
  it has already happened. Clipping prevents wasted experiment runs.

PR karpathy#279 — --profile flag for LLM-readable CUDA kernel summary (baseline/train.py)
  Adds argparse --profile flag that runs torch.profiler over a few warmup
  steps and prints a Markdown table of top CUDA kernels by self-time,
  then exits. Lets the agent identify hardware bottlenecks (attention vs
  MLP vs elementwise) without needing trace visualization tools.
  Usage: uv run baseline/train.py --profile

Issue karpathy#284 — DUSE alt program (baseline/program-alt.md)
  Alternative program.md integrating Dimensional UCB1 Search + Experiment
  Memory from issue karpathy#284. Adds: 7-dimension map, experiments.json structured
  memory, UCB1 dimension selector (exploration vs exploitation), 90-second
  early abort gate, rescue pool for recombining discarded sub-mechanisms.
  Pure prompt change, no code modifications required.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant