Pipelining between data loading and computation is a critical tensor program optimization for GPUs. Multi-stage pipelining across the multi-level buffer hierarchy of GPU is particularly indispensable on the latest NVIDIA Ampere GPUs to reduce resource idleness and guarantee kernel performance. Currently, people rely on libraries written by experts such as cuBLAS to access the pipelining optimization instead of through a tensor program transformation, which is inextensible to new operators and un-composable with prior tensor compiler optimizations. We present ALCOP, an automatic pipelining framework based on TVM infrastructure that overcomes three critical obstacles in generating code for pipelining: detection of pipelining-applicable buffers, program transformation for multi-level multi-stage pipelining, and efficient schedule parameter search by incorporating static analysis. Experiments show that ALCOP can generate programs with 1.23x speedup on average (up to 1.73x) over vanilla TVM. On end-to-end models, ALCOP can improve upon TVM by up to 1.18x, and XLA by up to 1.64x. Besides, our performance model significantly improves the efficiency of the schedule tuning process and can find schedules with 99% the performance given by exhaustive search while costing 40x fewer trials.
翻译:数据加载和计算之间的管道是 GPU 中一个至关重要的强点程序优化 。 GPU 多级缓冲层的多级管道配置对于最新的 NVIDIA A Ampere GPU 尤其必不可少, 以减少资源闲置和保证内核性能。 目前, 人们依靠CUBLAS 等专家撰写的图书馆访问管道优化, 而不是通过一个 Exor 程序转换, 这对于新操作员来说是无法扩展的, 并且无法与先前的 Exronor 编译器优化兼容 。 我们向 ALCOP 展示一个基于 TVM 基础设施的自动管道排气框架, 克服了生成管道代码的三个关键障碍: 检测管线性缓冲适用缓冲、 多级多级管道转换程序, 以及通过纳入静态分析来进行有效的时间表搜索 。 实验显示 ALCOP 能够生成平均1. 23x 速度超过 Vanilla TVM ( 1.73x) 的节目。 关于终端到 优化模型, ALCOP 可以在 TVM 上改进到 1. 18x, 和 XLA 到 1. 64x 上 更新到 1. 64x 。 此外 的自动改进 。 此外, 更新到 1. 64x 改进我们的工作进度到 1. 64x 大幅改进了 的进度, 改进了 。