PEAK：基于自然语言转换的GPU内核性能工程AI助手 (PEAK: A Performance Engineering AI-Assistant for GPU Kernels Powered by Natural Language Transformations)

Advancements in large language models (LLMs) are showing promising impact in software development and programming assistance. However, these models struggle when operating on low-level backend code. This challenge is exacerbated in the domain of GPU kernels, where performance-critical details are coupled to rapidly evolving hardware characteristics and available code examples are sparse. In this work, we introduce PEAK, a Performance Engineering AI-Assistant for GPU Kernels powered by natural language transformations. PEAK utilizes the key insight that iterative code transformations (optimizations) can straightforwardly be written in natural language, and then carried out by LLMs. Thus, these transformations can be rapidly developed, encoding general portable optimizations, but also easily specialized to specific GPU devices and even kernels. These natural transformations are supported by a modular and extensible infrastructure that additionally performs validation and performance evaluation. We demonstrate the flexibility of PEAK by instantiating it for three backends, CUDA, HIP, and HLSL, and create 16 natural transformations for optimizing matrix multiplication kernels. We show that our resulting implementations are competitive with vendor libraries when available, and for HLSL (without a library) our implementations match the hardware documented FLOPS. PEAK allows the fine-grained exploration of several research questions around how LLMs behave in this domain, including characterizing transformations and their errors; and how performance evolves along optimization sequences. PEAK provides an interface that can either be utilized by performance engineers to improve productivity, or driven completely autonomously (e.g., by an AI agent), providing a forward-compatible design that can continue to improve with advances in AI capabilities.

翻译：大型语言模型（LLM）的进展在软件开发和编程辅助方面展现出令人瞩目的影响。然而，这些模型在处理底层后端代码时仍面临困难。这一挑战在GPU内核领域尤为突出，因为性能关键细节与快速演进的硬件特性紧密耦合，且可用代码示例稀缺。本文提出PEAK——一种基于自然语言转换的GPU内核性能工程AI助手。PEAK的核心思路在于：迭代式代码转换（优化）可直接用自然语言描述，并由LLM执行。因此，这些转换规则既能快速开发以编码通用可移植优化，也能轻松适配特定GPU设备乃至具体内核。我们构建了模块化、可扩展的基础架构来支持这些自然语言转换，同时提供验证与性能评估功能。通过为CUDA、HIP和HLSL三种后端实现PEAK，并创建16种针对矩阵乘法内核优化的自然转换规则，我们展示了该系统的灵活性。实验表明：在存在厂商库的情况下，我们生成的实现方案具备竞争力；对于HLSL（无官方库）场景，我们的实现达到了硬件标称浮点运算峰值。PEAK支持对LLM在该领域行为的细粒度研究探索，包括转换规则的特征化及其错误分析，以及优化序列中性能的演进规律。该系统既可为性能工程师提供提升工作效率的交互界面，也可实现完全自主运行（例如由AI智能体驱动），其前瞻性设计能够持续适配人工智能技术的进步。