Maximizing performance on available GPU hardware is an ongoing challenge for modern AI inference systems. Traditional approaches include writing custom GPU kernels and using specialized model compilers to tune high-level code for specific GPU targets. Recent work shows that LLM-based multi-agent systems can effectively perform such tuning, often outperforming existing compilers and eliminating the need for manual kernel development. However, the dynamics of multi-agent systems for this task remain unexplored. In this work, we present a logical framework for comparing multi-agent PyTorch optimization systems. Our evaluation shows that exploit-heavy strategies perform best when paired with error-fixing agents, and that performance correlates with the granularity of optimization steps. The best implementation achieves an average 2.88x speedup on an H100 GPU across diverse tasks in KernelBench, a benchmark suite covering a range of machine learning architectures in PyTorch.
翻译:在现代人工智能推理系统中,最大化可用GPU硬件的性能是一项持续挑战。传统方法包括编写自定义GPU内核以及使用专用模型编译器针对特定GPU目标优化高级代码。近期研究表明,基于LLM的多智能体系统能够有效执行此类优化,其性能通常优于现有编译器,并消除了手动开发内核的需求。然而,针对此任务的多智能体系统动态特性尚未得到充分探索。本研究提出了一个用于比较多智能体PyTorch优化系统的逻辑框架。评估结果表明,当与错误修复智能体结合时,采用强探索策略的系统表现最佳,且性能与优化步骤的粒度呈正相关。在KernelBench基准测试套件(涵盖PyTorch中多种机器学习架构)中,最佳实现方案在H100 GPU上针对多样化任务实现了平均2.88倍的加速效果。