Welcome to the code repository of Autocomp. Recent updates:
(1/22/2026) Reorganized repo structure to make it easier to add a new hardware target.
(1/8/2026) Check out our latest 📝 blog post on optimizing attention on Trainium!
📚 Paper: Autocomp: A Powerful and Portable Code Optimizer for Tensor Accelerators
✏️ Authors: Charles Hong, Sahil Bhatia, Alvin Cheung, and Yakun Sophia Shao (UC Berkeley)
Autocomp is an LLM-driven code optimizer for tensor accelerators. Autocomp is designed to be portable and easy to use across a variety of hardware targets, and has already demonstrated strong performance on an industry accelerator (AWS Trainium), an academic accelerator (Gemmini), NVIDIA GPUs, and even the RISC-V Vector Extension.
Autocomp decomposes the optimization problem into a beam search, where each iteration is further divided into a planning phase and an implementation phase. Autocomp applies the user's domain knowledge, along with a variety of techniques to successfully explore the search space, in order to iteratively improve the code. For more details, see our paper.
Autocomp can currently optimize code for the following hardware targets:
- Trainium (trn_setup.md)
- Gemmini (gemmini_setup.md)
- CUDA via KernelBench (kb_setup.md)
- CUDA via GPU MODE (gpumode_setup.md)
Partially supported hardware targets:
- RISC-V Vector (RVV) on Canaan Kendryte K230. See
k230branch for code. As the implementation is very hacky, we do not currently recommend using this hardware target.
For instructions on adding a new hardware target, see ADDING_HARDWARE_SUPPORT.md.
Autocomp supports both local and remote endpoint LLM inference. For local inference, we support vLLM's OpenAI-compatible server. For endpoint inference, we support a variety of providers (see below).
-
Install and launch vLLM:
pip install vllm vllm serve --model Qwen/Qwen3-8B --port 8000 -tp <number of GPUs>
-
Configure Autocomp: Set
models/code_modelsinsearch.py:models = ["vllm::Qwen/Qwen3-8B"]
Optionally set
VLLM_API_BASEif using a different host/port (default:http://localhost:8000/v1). -
Multiple models on different ports: You can serve multiple vLLM models on separate ports and use them together by encoding the base URL in the provider string with the format
vllm@<base_url>::<model_name>:# Terminal 1 vllm serve --model Qwen/Qwen3-8B --port 8000 -tp 1 # Terminal 2 vllm serve --model meta-llama/Llama-3-70B --port 8001 -tp 4
models = [ "vllm@http://localhost:8000/v1::Qwen/Qwen3-8B", "vllm@http://localhost:8001/v1::meta-llama/Llama-3-70B", ]
For more details, see the vLLM documentation.
API keys can be configured via environment variables or in autocomp/common/keys.py. Environment variables take precedence over the keys file. The variable names in keys.py match the corresponding environment variable names.
Supported keys:
| Provider | Environment Variable / Key Name | Provider Name in search.py |
|---|---|---|
| OpenAI | OPENAI_API_KEY |
openai |
| Anthropic | ANTHROPIC_API_KEY |
anthropic |
| Together | TOGETHER_API_KEY |
together |
| AWS Bedrock | AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION |
aws |
| Google Cloud | GOOGLE_CLOUD_LOCATION, GOOGLE_CLOUD_PROJECT |
gcp |
Example autocomp/common/keys.py:
OPENAI_API_KEY = "sk-..."
ANTHROPIC_API_KEY = "sk-ant-..."
TOGETHER_API_KEY = "..."
AWS_ACCESS_KEY_ID = "AKIA..."
AWS_SECRET_ACCESS_KEY = "..."
GOOGLE_CLOUD_LOCATION = "us-central1"
GOOGLE_CLOUD_PROJECT = "my-project"Keys can be omitted if not needed. On startup, Autocomp logs which keys are available.
To use Gemini via Google Cloud, install the Google Cloud CLI as described at https://docs.cloud.google.com/sdk/docs/install-sdk#linux.
Run gcloud auth application-default login to enable the Google Cloud SDK.
Anthropic (Claude) models on Bedrock use the native Anthropic SDK adapter. All other Bedrock models (e.g., GLM, DeepSeek, Kimi) are supported via the Bedrock Converse API. Any model available in your Bedrock region can be used by passing its Bedrock model ID:
models = [
"aws::us.anthropic.claude-opus-4-5-20251101-v1:0", # Claude (Anthropic adapter)
"aws::zai.glm-4.7", # GLM 4.7
"aws::deepseek.v3.2", # DeepSeek-V3.2
"aws::moonshotai.kimi-k2.5", # Kimi K2.5
]By default the us-west-2 region is used. Set the AWS_REGION environment variable (or add it to keys.py) to override.
autocomp/search/search.py is the entry point for running Autocomp optimization. Various parameters such as hardware target, models used, beam size, number of plans, number of code implementations, dropout, etc. can be configured here.
Notable parameters:
backend_name: The hardware target to use. Currently supported values aregemmini,trn,kernelbench, andgpumode.agent_name: The LLM agent type to use. Defaults based onbackend_name. Currently supported agents aregemmini,trn, andcuda(used for bothkernelbenchandgpumode).hw_config: A hardware configuration object describing the target hardware. Examples:TrnHardwareConfig("trn1.2xlarge")GemminiHardwareConfig(pe_dim=16, spad_size_kb=256, acc_size_kb=64)CudaHardwareConfig("NVIDIA L40S", "2.5.0", "12.4")
models: The list of models to use. Models are specified"<provider>::<model>", for example"openai::gpt-5.2"or"gcp::gemini-3-pro-preview". Currently supported endpoint providers are OpenAI (openai), Google Vertex AI (gcp), Anthropic (anthropic), AWS Bedrock (aws), and Together (together). Use providervllmfor local serving.code_models: The list of models to use for the implementation phase of prompting, if you would like to use a distinct set of models from planning. Can be set toNoneto use the same set of models.simulator: The evaluation method to use, if multiple are supported.- For Trainium, doesn't matter (put
None) - For Gemmini,
spike(only optimizes instruction counts, not cycle counts) orfiresim - For CUDA/KernelBench, doesn't matter (put
None) - For CUDA/GPU MODE,
gpumode-localorgpumode-cli
- For Trainium, doesn't matter (put
iterations: The number of iterations to run.search_strategy: The search strategy to use. Currently onlybeamis supported.prob_type: The problem type to use.- For Trainium,
trn-tutorialortrn-advanced. - For Gemmini,
gemm,conv, oradmm-multifunction. - For CUDA/KernelBench,
kb-level1,kb-level2,kb-level3, orkb-level4. - For CUDA/GPU MODE,
gpumode.
- For Trainium,
prob_id: The problem ID to use.
autocomp/ - Core Autocomp code.
search/- Search algorithm (search.py) and optimization infrastructure.agents/- LLM agents for planning and code generation. Each hardware target has its own subdirectory (e.g.,gemmini/,trn/,cuda/) with agent code and prompts.backend/- Eval backends for code evaluation. Each eval backend has its own subdirectory (e.g.,gemmini/,trn/,kernelbench/,gpumode/) with evaluation code and setup instructions. One hardware target can have multiple eval backends.hw_config/- Hardware configuration classes. Each hardware target has a config file (e.g.,cuda_config.py,gemmini_config.py,trn_config.py).common/- Shared utilities (LLM interface, logging, etc.).llm_utils.py- LLM interface. Modify this file if you want to add a new LLM provider.
sols/ - Baseline code for benchmarks (organized by problem type).
tests/ - Test cases corresponding to sols/.
examples/ - Example optimization traces from Autocomp.
@misc{hong2025autocomp,
title={Autocomp: A Powerful and Portable Code Optimizer for Tensor Accelerators},
author={Charles Hong and Sahil Bhatia and Alvin Cheung and Yakun Sophia Shao},
year={2025},
eprint={2505.18574},
archivePrefix={arXiv},
primaryClass={cs.PL},
url={https://arxiv.org/abs/2505.18574},
}
(11/18/2025) Added documentation for adding a new hardware target (ADDING_HARDWARE_SUPPORT.md), added the examples directory for example optimization traces, and published 📝 blog post 4 about how we optimized conv1d on Trainium.
(11/3/2025) Added code/documentation for setting up Trainium. Check out 📝 blog post 3 for more details.
(9/22/2025) Added code/documentation for setting up CUDA/KernelBench, plus code for RVV optimization. Check out 📝 blog post 2 for more details.
(6/6/2025) Initial code + 📝 blog post 1 release!