Papers
arxiv:2511.15915

AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization

Published on Apr 15
· Submitted by
Genghan Zhang
on Apr 20
Authors:
,
,
,
,
,
,
,

Abstract

AccelOpt is a self-improving LLM agentic system that autonomously optimizes kernels for AI accelerators using iterative generation and optimization memory, achieving significant throughput improvements at reduced costs.

AI-generated summary

We present AccelOpt, a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI acclerators, eliminating the need for expert-provided hardware-specific optimization knowledge. AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt's capability improves over time, boosting the average percentage of peak throughput from 49% to 61% on Trainium 1 and from 45% to 59% on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being 26times cheaper. The code is open-sourced at https://github.com/zhang677/AccelOpt.

Community

Paper author Paper submitter
edited about 22 hours ago
  • AccelOpt boosts the average percentage of peak throughput from 49% to 61% on Trainium 1 and from 45% to 59% on Trainium 2 for NKIBench kernels.
  • AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being 26x cheaper.
  • AccelOpt is agnostic to kernel languages. On 24 Triton kernels from FlashInfer-Bench (H100), AccelOpt with gpt-oss-120b achieved 1.27x average speedup over best Triton baselines, with 3.19x peak speedup on a GQA decoding kernel. Such adoption took the first author 3 days.
  • In Stanford CS149 Fall 2025, a graduate-level parallel computing course, AccelOpt optimized a Conv2D kernel outside of NKIBench and achieved 48.8% of peak throughput, starting from last year's reference implementation (9.54%). Based on the optimization proposed by AccelOpt, we designed an extra credit problem where 33.6% of 131 teams of students successfully conquered the challenge.
  • AccelOpt paper was accepted by MLSys 2026.

main-method-shaowz

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2511.15915
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 3

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.15915 in a Space README.md to link it from this page.

Collections including this paper 1