基于LLM多智能体系统的PyTorch推理优化 (Optimizing PyTorch Inference with LLM-Based Multi-Agent Systems)

Maximizing performance on available GPU hardware is an ongoing challenge for modern AI inference systems. Traditional approaches include writing custom GPU kernels and using specialized model compilers to tune high-level code for specific GPU targets. Recent work shows that LLM-based multi-agent systems can effectively perform such tuning, often outperforming existing compilers and eliminating the need for manual kernel development. However, the dynamics of multi-agent systems for this task remain unexplored. In this work, we present a logical framework for comparing multi-agent PyTorch optimization systems. Our evaluation shows that exploit-heavy strategies perform best when paired with error-fixing agents, and that performance correlates with the granularity of optimization steps. The best implementation achieves an average 2.88x speedup on an H100 GPU across diverse tasks in KernelBench, a benchmark suite covering a range of machine learning architectures in PyTorch.

翻译：在现代人工智能推理系统中，最大化可用GPU硬件的性能是一项持续挑战。传统方法包括编写自定义GPU内核以及使用专用模型编译器针对特定GPU目标优化高级代码。近期研究表明，基于LLM的多智能体系统能够有效执行此类优化，其性能通常优于现有编译器，并消除了手动开发内核的需求。然而，针对此任务的多智能体系统动态特性尚未得到充分探索。本研究提出了一个用于比较多智能体PyTorch优化系统的逻辑框架。评估结果表明，当与错误修复智能体结合时，采用强探索策略的系统表现最佳，且性能与优化步骤的粒度呈正相关。在KernelBench基准测试套件（涵盖PyTorch中多种机器学习架构）中，最佳实现方案在H100 GPU上针对多样化任务实现了平均2.88倍的加速效果。