feat: add --profile flag for LLM-readable CUDA kernel summary#279
Open
Bortlesboat wants to merge 1 commit intokarpathy:masterfrom
Open
feat: add --profile flag for LLM-readable CUDA kernel summary#279Bortlesboat wants to merge 1 commit intokarpathy:masterfrom
Bortlesboat wants to merge 1 commit intokarpathy:masterfrom
Conversation
Add a --profile CLI flag to train.py that runs torch.profiler over a few warmup steps and prints a Markdown table of the top CUDA kernels by self-time. This lets an AI agent identify hardware bottlenecks (FlashAttention vs MLP vs other kernels) without needing trace visualization tools. Usage: uv run train.py --profile Exits after printing the table without consuming the TIME_BUDGET. Closes karpathy#118
IgorTavcar
added a commit
to IgorTavcar/autoresearch
that referenced
this pull request
Mar 17, 2026
…iler, UCB1 search PR karpathy#287 — Gradient clipping before optimizer step (baseline/train.py) Adds GRAD_CLIP_NORM hyperparameter (default 1.0, set 0.0 to disable). relu² activations can produce gradient spikes that silently degrade weights. The existing loss > 100 fast-fail only catches damage after it has already happened. Clipping prevents wasted experiment runs. PR karpathy#279 — --profile flag for LLM-readable CUDA kernel summary (baseline/train.py) Adds argparse --profile flag that runs torch.profiler over a few warmup steps and prints a Markdown table of top CUDA kernels by self-time, then exits. Lets the agent identify hardware bottlenecks (attention vs MLP vs elementwise) without needing trace visualization tools. Usage: uv run baseline/train.py --profile Issue karpathy#284 — DUSE alt program (baseline/program-alt.md) Alternative program.md integrating Dimensional UCB1 Search + Experiment Memory from issue karpathy#284. Adds: 7-dimension map, experiments.json structured memory, UCB1 dimension selector (exploration vs exploitation), 90-second early abort gate, rescue pool for recombining discarded sub-mechanisms. Pure prompt change, no code modifications required. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Addresses #118. The AI agent has no visibility into which CUDA kernels are the actual hardware bottlenecks. It can only infer from step time and MFU%, which are too coarse to guide targeted optimization (e.g., "is attention or MLP the bottleneck?").
Solution
Add a
--profileflag totrain.pythat:torch.profilerover a small number of warmup steps (2 warmup + 5 active)TIME_BUDGETUsage:
Example output:
This lets the agent immediately see, for example, that FlashAttention dominates at 38% — and decide whether to explore window size changes vs. MLP width changes.
Changes
train.pyonly (the single file agents are allowed to modify)+sys,+argparseimports (~2 lines)argparse.parse_known_args()block (~5 lines,add_help=Falseso it doesn't conflict withuv run)run_profiler()function (~35 lines)if PROFILE_MODE: run_profiler(...); sys.exit(0)(~4 lines)No changes to
prepare.py,program.md, or any other file. Normal training runs are completely unaffected.