We present AccelOpt, a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI acclerators, eliminating the need for expert-provided hardware-specific optimization knowledge. AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt's capability improves over time, boosting the average percentage of peak throughput from $49\%$ to $61\%$ on Trainium 1 and from $45\%$ to $59\%$ on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being $26\times$ cheaper.
翻译:本文提出AccelOpt,一种自改进的大型语言模型(LLM)代理系统,能够自主优化新兴AI加速器的内核,无需专家提供硬件特定的优化知识。AccelOpt通过迭代生成探索内核优化空间,其过程由一个优化记忆库指导,该记忆库整理并利用先前遇到的慢-快内核对的经验与洞察。我们构建了NKIBench,这是一个从真实世界LLM工作负载中提取的、具有不同复杂度的AWS Trainium加速器内核新基准套件,用于评估AccelOpt的有效性。评估结果表明,AccelOpt的能力随时间持续提升:在NKIBench内核上,Trainium 1的平均峰值吞吐量占比从$49\\%$提升至$61\\%$,Trainium 2则从$45\\%$提升至$59\\%$。此外,AccelOpt具有极高的成本效益:使用开源模型时,其内核改进效果与Claude Sonnet 4相当,而成本降低$26\\times$。