arxiv:2511.15915

AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization

Published on Apr 15

· Submitted by

Genghan Zhang on Apr 20

Stanford University

Upvote

Authors:

Genghan Zhang ,

Abstract

AccelOpt is a self-improving LLM agentic system that autonomously optimizes kernels for AI accelerators using iterative generation and optimization memory, achieving significant throughput improvements at reduced costs.

AI-generated summary

We present AccelOpt, a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI acclerators, eliminating the need for expert-provided hardware-specific optimization knowledge. AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt's capability improves over time, boosting the average percentage of peak throughput from 49% to 61% on Trainium 1 and from 45% to 59% on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being 26times cheaper. The code is open-sourced at https://github.com/zhang677/AccelOpt.

View arXiv page View PDF Project page GitHub 34 Add to collection

Community

Genghan

Paper author Paper submitter about 22 hours ago

•

edited about 22 hours ago

AccelOpt boosts the average percentage of peak throughput from 49% to 61% on Trainium 1 and from 45% to 59% on Trainium 2 for NKIBench kernels.
AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being 26x cheaper.
AccelOpt is agnostic to kernel languages. On 24 Triton kernels from FlashInfer-Bench (H100), AccelOpt with gpt-oss-120b achieved 1.27x average speedup over best Triton baselines, with 3.19x peak speedup on a GQA decoding kernel. Such adoption took the first author 3 days.
In Stanford CS149 Fall 2025, a graduate-level parallel computing course, AccelOpt optimized a Conv2D kernel outside of NKIBench and achieved 48.8% of peak throughput, starting from last year's reference implementation (9.54%). Based on the optimization proposed by AccelOpt, we designed an extra credit problem where 33.6% of 131 teams of students successfully conquered the challenge.
AccelOpt paper was accepted by MLSys 2026.