张量核心GPU上的最优软件流水线与线程束专业化 (Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs)

GPU architectures have continued to grow in complexity, with recent incarnations introducing increasingly powerful fixed-function units for matrix multiplication and data movement to accompany highly parallel general-purpose cores. To fully leverage these machines, software must use sophisticated schedules that maximally utilize all hardware resources. Since realizing such schedules is complex, both programmers and compilers routinely employ program transformations, such as software pipelining (SWP) and warp specialization (WS), to do so in practice. However, determining how best to use SWP and WS in combination is a challenging problem that is currently handled through a mix of brittle compilation heuristics and fallible human intuition, with little insight into the space of solutions. To remedy this situation, we introduce a novel formulation of SWP and WS as a joint optimization problem that can be solved holistically by off-the-shelf constraint solvers. We reify our approach in Twill, the first system that automatically derives optimal SWP and WS schedules for a large class of iterative programs. Twill is heuristic-free, easily extensible to new GPU architectures, and guaranteed to produce optimal schedules. We show that Twill can rediscover, and thereby prove optimal, the SWP and WS schedules manually developed by experts for Flash Attention on both the NVIDIA Hopper and Blackwell GPU architectures.

翻译：GPU架构持续向复杂化演进，最新架构在高度并行的通用计算核心之外，引入了日益强大的矩阵乘法与数据移动专用硬件单元。为充分发挥此类硬件效能，软件必须采用能最大化利用所有硬件资源的复杂调度方案。由于实现此类调度具有高度复杂性，程序员与编译器在实践中通常依赖软件流水线（SWP）与线程束专业化（WS）等程序变换技术。然而，如何协同使用SWP与WS以获得最优效果，当前仍依赖脆弱的编译启发式方法与易出错的人工直觉，且缺乏对解空间的系统性认知。为解决这一问题，我们提出将SWP与WS联合建模为可通过现成约束求解器进行整体求解的优化问题。基于此方法，我们实现了Twill系统——首个能自动为广泛迭代程序生成最优SWP与WS调度方案的系统。Twill无需启发式规则，可轻松扩展至新型GPU架构，且具备最优调度方案的理论保证。实验表明，Twill能自动推导出专家为NVIDIA Hopper与Blackwell GPU架构上Flash Attention手动设计的SWP与WS调度方案，并证明其最优性。

相关内容