As deep learning models nowadays are widely adopted by both cloud services and edge devices, reducing the latency of deep learning model inferences becomes crucial to provide efficient model serving. However, it is challenging to develop efficient tensor programs for deep learning operators due to the high complexity of modern accelerators and the rapidly growing number of operators. Deep learning compilers, such as Apache TVM, adopt declarative scheduling primitives to lower the bar of developing tensor programs. However, we show that this approach is insufficient to cover state-of-the-art tensor program optimizations. In this paper, we propose to embed the scheduling process into tensor programs and use dedicated mappings, called task mappings, to define the computation assignment and ordering. This new approach greatly enriches the expressible optimizations by allowing developers to manipulate tensor programs at a much finer granularity. We call the proposed method the task-mapping programming paradigm. In addition, we propose a new post-scheduling fusion optimization that allows developers to focus on scheduling every single operator and automates the fusion after scheduling. It greatly reduces the engineering efforts for operator fusion. Our proposed paradigm also constructs an efficient hardware-centric schedule space, which is agnostic to the program input size and greatly reduces the tuning time. With the proposed paradigm, we implement a deep learning compiler Hidet. Extensive experiments on modern convolution and transformer models show that Hidet outperforms state-of-the-art DNN inference framework, ONNX Runtime, and compiler, TVM equipped with scheduler AutoTVM and Ansor, by up to 1.48x (1.22x on average). It also reduces the tuning time by 20x and 11x compared with AutoTVM and Ansor, respectively. We open-sourced hidet at https://www.github.com/hidet-org/hidet.
翻译:由于目前深层次的学习模式被云层服务和边缘装置广泛采用,降低深层次学习模式推断值的延迟度对于提供高效模型服务至关重要。 然而,由于现代加速器的高度复杂性和操作员数量的迅速增长,为深层次学习操作员开发高效的抗冲程序具有挑战性。 Apache TVM 等深层学习编程者采用宣示性排程原始程序来降低开发高压程序的范围。然而,我们表明,这一方法不足以覆盖最先进的高端高压程序优化。在本文中,我们提议将排程流程进程嵌入高压程序,并使用专门的电视节目绘图,称为任务映射图,用于定义计算任务分配。这个新办法极大地丰富了可显示的优化,让开发者在更细微的颗粒状态下操控程序操作。我们提议的方法是任务映射式程序模式,让开发者通过每个单一的操作员和自动编程来集中安排,在列表后,运行的轨迹图图图解图图,将大幅降低操作员的工程努力量。我们提议在高层次的节流流流流流流流式程序上,同时也进行。