Deep learning models rely on highly optimized tensor libraries for efficient inference on heterogeneous hardware. Current deep compilers typically predetermine layouts of tensors and then optimize loops of operators. However, such unidirectional and one-off workflow strictly separates graph-level optimization and operator-level optimization into different system layers, missing opportunities for unified tuning. This paper proposes ALT, a compiler that performs joint graph- and operator-level optimizations for deep models. ALT provides a generic transformation module to manipulate layouts and loops with easy-to-use primitive functions. ALT further integrates an auto-tuning module that jointly optimizes graph-level data layouts and operator-level loops while guaranteeing efficiency. Experimental results show that ALT significantly outperforms state-of-the-art compilers (e.g., Ansor) in terms of both single operator performance (e.g., 1.5x speedup on average) and end-to-end inference performance (e.g., 1.4x speedup on average).
翻译:深层学习模型依靠高度优化的稀有图书馆对多种硬件进行有效的推断。当前深层的编译者通常先决定数组的预选版式,然后优化操作者环形。然而,这种单向和一次性的工作流程将图形级优化和操作者级优化严格地分开到不同的系统层,缺少统一调试的机会。本文提出ALT,这是一个为深层模型进行联合图形级和操作者级优化的编译者。ALT提供了一个通用的转换模块,用于使用易于使用的原始功能来操作布局和环形。ALT还进一步整合了一个自动调控模块,该模块在保证效率的同时,共同优化图形级数据布局和操作者级环形。实验结果显示,从单个操作者性能(例如平均1.5x速度)和终端至终端性推算性能(例如平均速度1.4x速度)来看,ALT大大超出最先进的编译者(例如Ansor)的状态(例如,Ansor) 。