AKG Kernel Agent：一种面向跨平台内核合成的多智能体框架 (AKG kernel Agent: A Multi-Agent Framework for Cross-Platform Kernel Synthesis)

Jinye Du,Quan Yuan,Zuyao Zhang,Yanzhi Yi,Jiahui Hu,Wangyi Chen,Yiyang Zhu,Qishui Zheng,Wenxiang Zou,Xiangyu Chang,Zuohe Zheng,Zichun Ye,Chao Liu,Shanni Li,Renwei Zhang,Yiping Deng,Xinwei Hu,Xuefeng Jin,Jie Zhao

Modern AI models demand high-performance computation kernels. The growing complexity of LLMs, multimodal architectures, and recommendation systems, combined with techniques like sparsity and quantization, creates significant computational challenges. Moreover, frequent hardware updates and diverse chip architectures further complicate this landscape, requiring tailored kernel implementations for each platform. However, manual optimization cannot keep pace with these demands, creating a critical bottleneck in AI system development. Recent advances in LLM code generation capabilities have opened new possibilities for automating kernel development. In this work, we propose AKG kernel agent (AI-driven Kernel Generator), a multi-agent system that automates kernel generation, migration, and performance tuning. AKG kernel agent is designed to support multiple domain-specific languages (DSLs), including Triton, TileLang, CPP, and CUDA-C, enabling it to target different hardware backends while maintaining correctness and portability. The system's modular design allows rapid integration of new DSLs and hardware targets. When evaluated on KernelBench using Triton DSL across GPU and NPU backends, AKG kernel agent achieves an average speedup of 1.46$\times$ over PyTorch Eager baselines implementations, demonstrating its effectiveness in accelerating kernel development for modern AI workloads.

翻译：现代人工智能模型对高性能计算内核提出了更高要求。大型语言模型、多模态架构与推荐系统的日益复杂化，结合稀疏化与量化等技术，带来了显著的计算挑战。此外，频繁的硬件更新与多样化的芯片架构进一步加剧了系统复杂性，需要为每个平台定制内核实现。然而，人工优化已无法满足这些需求，成为人工智能系统开发的关键瓶颈。近期大型语言模型在代码生成能力上的突破，为内核开发自动化开辟了新路径。本研究提出AKG内核智能体（AI驱动的内核生成器），这是一个实现内核生成、迁移与性能调优自动化的多智能体系统。该框架设计支持多种领域专用语言，包括Triton、TileLang、CPP与CUDA-C，使其能够在确保正确性与可移植性的同时面向不同硬件后端。系统的模块化设计支持快速集成新领域专用语言与硬件目标。通过在KernelBench基准测试中使用Triton领域专用语言，在GPU与NPU后端进行评估，AKG内核智能体相较于PyTorch Eager基线实现平均达到1.46$\times$的加速比，证明了其在加速现代人工智能工作负载内核开发方面的有效性。