Optimize any AI kernel, anywhere.
| Paper | Blog | VS Code Extension |
Autocomp is an LLM-powered kernel optimizer that writes faster code for your accelerator β so you don't have to. Point it at a kernel, pick your hardware target, and Autocomp speeds it up, automatically.
It already delivers strong results across AWS Trainium, Google TPU, NVIDIA GPUs, Gemmini, and the RISC-V Vector Extension. Need a new target? The Agent Builder can spin up a hardware-specific optimization agent from your docs in minutes.
π Read the paperΒ Β Β·Β Β βοΈ Authors: Charles Hong, Sahil Bhatia, Alvin Cheung, Yakun Sophia Shao (UC Berkeley)
Autocomp's workflow is:
- Pick your hardware target:
- Choose an optimization agent (or build your own).
- Set up an evaluation backend.
- Configure one or more LLMs.
- Edit
autocomp/search/run_search.pywith your settings. - Run search.
For example, a Trainium run might look like this:
# autocomp/search/run_search.py
backend_name = "trn"
agent_name = "built:trn1-nki1"
hw_config = TrnHardwareConfig("trn1.2xlarge")
prob_type = "trn-tutorial-nki1"
prob_id = 2
models = ["openai::gpt-5.2"]Then run:
python -m autocomp.search.run_searchKeep reading for more on picking your hardware target, setting up your backend, configuring LLM providers, and tuning the search.
Each hardware target requires two things: an optimization agent that knows how to optimize code for that target, and an evaluation backend β the toolchain that compiles and benchmarks code on it. You also provide a hardware config (hw_config) that describes your specific hardware instance (e.g., TrnHardwareConfig("trn1.2xlarge")). The table below shows the supported targets and the agents/backends available for each.
| Hardware target | Optimization agent(s) | Evaluation backend(s) |
|---|---|---|
| AWS Trainium | built:trn1-nki1 (Trainium 1, NKI v1)built:trn2-nki1 (Trainium 2, NKI v1)built:trn2-nki2 (Trainium 2, NKI v2) |
trn (trn_setup.md) |
| Google TPU | built:tpu-v6e (TPU v6e) |
tpu (tpu_setup.md) |
| Gemmini | gemmini |
gemmini (gemmini_setup.md) |
| NVIDIA GPU | cuda |
kernelbench (kb_setup.md)gpumode (gpumode_setup.md) |
Partially supported hardware targets:
- RISC-V Vector (RVV) on Canaan Kendryte K230. See
k230branch for code. As the implementation is very hacky, we do not currently recommend using this hardware target.
For instructions on adding full codebase support for a new hardware target (eval backend, config class, etc.), see ADDING_HARDWARE_SUPPORT.md.
Optimization agents decide what transformations to try and how to implement them. In run_search.py, this is controlled by agent_name. Each agent is designed for a specific hardware target β see the table above for the right agent for each target. We recommend using the Agent Builder as the fastest way to set up a complete agent from your hardware's documentation.
Want to create a new agent? The Agent Builder automatically generates hardware-specific optimization agents from documentation sources such as local directories, PDFs, and webpages. Built agents are stored in autocomp/agent_builder/.built/ and selected with agent_name = "built:<name>". Legacy handcrafted agents in autocomp/agents/ (e.g., gemmini, cuda) are also available for some targets.
pip install "autocomp[agent-builder]"
python -m autocomp.agent_builder.run_agent_builder \
--agent-name my_accelerator \
--source-dir path/to/docs \
--agent-scope "Optimizing kernels for MyAccelerator using the XYZ programming interface."For detailed usage, CLI options, Python API, and output format, see the Agent Builder documentation.
Autocomp supports both local and remote endpoint LLM inference. For local inference, we support vLLM's OpenAI-compatible server. For endpoint inference, we support a variety of providers (see below).
-
Install and launch vLLM:
pip install vllm vllm serve --model Qwen/Qwen3-8B --port 8000 -tp <number of GPUs>
-
Configure Autocomp: Set
models/code_modelsinrun_search.py:models = ["vllm::Qwen/Qwen3-8B"]
Optionally set
VLLM_API_BASEif using a different host/port (default:http://localhost:8000/v1). -
Multiple models on different ports: You can serve multiple vLLM models on separate ports and use them together by encoding the base URL in the provider string with the format
vllm@<base_url>::<model_name>:# Terminal 1 vllm serve --model Qwen/Qwen3-8B --port 8000 -tp 1 # Terminal 2 vllm serve --model meta-llama/Llama-3-70B --port 8001 -tp 4
models = [ "vllm@http://localhost:8000/v1::Qwen/Qwen3-8B", "vllm@http://localhost:8001/v1::meta-llama/Llama-3-70B", ]
For more details, see the vLLM documentation.
API keys can be configured via environment variables or in autocomp/common/keys.py. Environment variables take precedence over the keys file. The variable names in keys.py match the corresponding environment variable names.
Supported keys:
| Provider | Environment Variable / Key Name | Provider Name in run_search.py |
|---|---|---|
| OpenAI | OPENAI_API_KEY |
openai |
| Anthropic | ANTHROPIC_API_KEY |
anthropic |
| Together | TOGETHER_API_KEY |
together |
| AWS Bedrock | AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION |
aws |
| Google Cloud (Vertex AI) | GOOGLE_CLOUD_LOCATION, GOOGLE_CLOUD_PROJECT |
gcp |
| Google AI Studio | GOOGLE_API_KEY |
gcp |
Example autocomp/common/keys.py:
OPENAI_API_KEY = "sk-..."
ANTHROPIC_API_KEY = "sk-ant-..."
TOGETHER_API_KEY = "..."
AWS_ACCESS_KEY_ID = "AKIA..."
AWS_SECRET_ACCESS_KEY = "..."
GOOGLE_CLOUD_LOCATION = "us-central1"
GOOGLE_CLOUD_PROJECT = "my-project"
GOOGLE_API_KEY = "AIza..."Keys can be omitted if not needed. On startup, Autocomp logs which keys are available.
Option 1: Google Cloud (Vertex AI). Install the Google Cloud CLI as described at https://docs.cloud.google.com/sdk/docs/install-sdk#linux. Run gcloud auth application-default login and set GOOGLE_CLOUD_PROJECT and GOOGLE_CLOUD_LOCATION.
Option 2: Google AI Studio. Get an API key from Google AI Studio and set GOOGLE_API_KEY.
If both Vertex AI credentials and GOOGLE_API_KEY are set, Vertex AI is used.
Anthropic (Claude) models on Bedrock use the native Anthropic SDK adapter. All other Bedrock models (e.g., GLM, DeepSeek, Kimi) are supported via the Bedrock Converse API. Any model available in your Bedrock region can be used by passing its Bedrock model ID:
models = [
"aws::us.anthropic.claude-opus-4-5-20251101-v1:0", # Claude (Anthropic adapter)
"aws::zai.glm-5", # GLM 5
]By default the us-west-2 region is used. Set the AWS_REGION environment variable (or add it to keys.py) to override.
autocomp/search/run_search.py is the entry point for running Autocomp optimization.
python -m autocomp.search.run_searchThe most important parameters are:
Hardware Target
hw_config: A hardware configuration object describing the target hardware. Examples:TrnHardwareConfig("trn1.2xlarge")TpuHardwareConfig("v6e-1")GemminiHardwareConfig(pe_dim=16, spad_size_kb=256, acc_size_kb=64)CudaHardwareConfig("NVIDIA L40S", "2.5.0", "12.4")
Evaluation Backend
backend_name: The evaluation backend to use. Currently supported values aretrn,tpu,gemmini,kernelbench, andgpumode.simulator: The evaluation method to use, if the backend supports multiple. For all others, putNone.- For Gemmini,
spike(only optimizes instruction counts, not cycle counts) orfiresim - For CUDA/GPU MODE,
gpumode-localorgpumode-cli
- For Gemmini,
Benchmark
prob_type: The problem type to use.- For Trainium,
trn-tutorial-nki1,trn-tutorial-nki2,trn-advanced-nki1, ortrn-advanced-nki2. - For TPU,
tpu,jaxbench-pallas,jaxbench-real,jaxbench-priority,jaxbench-tokamax, orjaxkernelbench. - For Gemmini,
gemm,conv, oradmm-multifunction. - For CUDA/KernelBench,
kb-level1,kb-level2,kb-level3, orkb-level4. - For CUDA/GPU MODE,
gpumode.
- For Trainium,
prob_id: The problem ID to use.
Optimization Agent
agent_name: The optimization agent to use. See the table above for the right agent for each target.
Models
models: The list of models to use. Models are specified"<provider>::<model>", for example"openai::gpt-5.2"or"gcp::gemini-3-pro-preview". Currently supported endpoint providers are OpenAI (openai), Google Vertex AI (gcp), Anthropic (anthropic), AWS Bedrock (aws), and Together (together). Use providervllmfor local serving.code_models: The list of models to use for the implementation phase, if you would like to use a distinct set of models from planning. Can be set toNoneto use the same set of models.
Search
iterations: The number of iterations to run.search_strategy: The search strategy to use. Currently onlybeamis supported.num_plan_candidates: Number of plans (strategies) generated per parent candidate per iteration. Default4.num_code_candidates: Number of code implementations generated per plan. Default2.beam_size: Number of candidates kept in the beam after each iteration. Default4.dropout_menu_options: Probability of dropping each strategy menu option from the prompt, encouraging diversity. Default0.25.early_stop_iters: Stop after N iterations without improvement (0 = disabled).resume_from: Path to a previous run's output directory. Loads the final candidates from that run as the starting beam (e.g., to optimize after a translation-only run).
Code Generation
use_edits: IfTrue, the LLM outputs structured JSON edits (old_str/new_strpairs) instead of rewriting the entire file. Generally more effective when code size is large. Defaults toFalse.reimplement_failed: Re-generate code for candidates that failed evaluation (only works on supported agents).
Translation
translate_iters: Number of initial iterations that use translation strategies (converting code to the target representation) instead of optimization strategies. Defaults to0(no translation). Only works on supported agents. Built agents load strategies fromtranslate_menu.yaml; see Agent Builder docs.translate_perf_threshold: During translation iterations, candidates are kept if their score is within this factor of the best score (e.g.,1.2means up to 20% worse).translate_score: IfTrue, score translation candidates by code similarity to the original (how complete the translation is), not just latency. Defaults toTrue.translate_drop_original: IfTrue, drop the original (untranslated) candidate from the beam after the last translation iteration. Defaults toTrue.
Built Agent Options
menu_strategy: Set to"one-shot"to dynamically generate new strategies per candidate via an LLM call, orNonefor static menu only.fine_grained_isa: Enables two-level ISA filtering (section then subsection) to include only relevant ISA documentation in the prompt.example_rate: Per-example probability of including an LLM-selected code example in the planning prompt.
The Autocomp Trace Visualizer is a VS Code extension for exploring optimization runs interactively. After a run completes, use it to understand what strategies worked, how scores improved, and where the search spent its time. See the Trace Visualizer documentation for install instructions and features.
autocomp/ - Core Autocomp code.
search/- Search algorithm (search.py) and optimization infrastructure.run_search.pyis the entry point.agents/- LLM agents for planning and code generation. Each hardware target has its own subdirectory (e.g.,gemmini/,trn/,cuda/) with agent code and prompts.agent_builder/- Agent Builder pipeline for creating new hardware-specific agents from documentation sources. See Agent Builder documentation for details.backend/- Eval backends for code evaluation. Each eval backend has its own subdirectory (e.g.,gemmini/,trn/,tpu/,kernelbench/,gpumode/) with evaluation code and setup instructions. One hardware target can have multiple eval backends.hw_config/- Hardware configuration classes. Each hardware target has a config file (e.g.,cuda_config.py,gemmini_config.py,trn_config.py,tpu_config.py).common/- Shared utilities (LLM interface, logging, etc.).llm_utils.py- LLM interface. Modify this file if you want to add a new LLM provider.
sols/ - Baseline code for benchmarks (organized by problem type).
tests/ - Test cases corresponding to sols/.
examples/ - Example optimization traces from Autocomp.
@misc{hong2025autocomp,
title={Autocomp: A Powerful and Portable Code Optimizer for Tensor Accelerators},
author={Charles Hong and Sahil Bhatia and Alvin Cheung and Yakun Sophia Shao},
year={2025},
eprint={2505.18574},
archivePrefix={arXiv},
primaryClass={cs.PL},
url={https://arxiv.org/abs/2505.18574},
}
Install dev dependencies:
pip install -e ".[dev]"Run tests:
WANDB_MODE=disabled pytestSee CONTRIBUTING.md for more details on how to add tests and the CI workflow.
(3/25/2026) Added support for structured-output code edits in the code implementation phase.
(3/17/2026) Added preliminary TPU support and enhanced Autocomp's code translation capabilities.
(3/13/2026) Added the Agent Builder for automatically creating hardware-specific LLM agents from documentation sources.
(1/22/2026) Reorganized repo structure to make it easier to add a new hardware target.
(1/8/2026) Check out our latest π blog post on optimizing attention on Trainium!
(11/18/2025) Added documentation for adding a new hardware target (ADDING_HARDWARE_SUPPORT.md), added the examples directory for example optimization traces, and published π blog post 4 about how we optimized conv1d on Trainium.
(11/3/2025) Added code/documentation for setting up Trainium. Check out π blog post 3 for more details.
(9/22/2025) Added code/documentation for setting up CUDA/KernelBench, plus code for RVV optimization. Check out π blog post 2 for more details.
(6/6/2025) Initial code + π blog post 1 release!