As deep learning models nowadays are widely adopted by both cloud services and edge devices, the latency of deep learning model inferences becomes crucial to provide efficient model serving. However, it is challenging to develop efficient tensor programs for deep learning operators due to the high complexity of modern accelerators (e.g., NVIDIA GPUs and Google TPUs) and the rapidly growing number of operators. Deep learning compilers, such as Apache TVM, adopt declarative scheduling primitives to lower the bar of developing tensor programs. However, we show that this approach is insufficient to cover state-of-the-art tensor program optimizations (e.g., double buffering). In this paper, we propose to embed the scheduling process into tensor programs and use dedicated mappings, called task mappings, to define the computation assignment and ordering directly in the tensor programs. This new approach greatly enriches the expressible optimizations by allowing developers to manipulate tensor programs at a much finer granularity (e.g., allowing program statement-level optimizations). We call the proposed method the task-mapping-oriented programming paradigm. With the proposed paradigm, we implement a deep learning compiler - Hidet. Extensive experiments on modern convolution and transformer models show that Hidet outperforms state-of-the-art DNN inference framework, ONNX Runtime, and compiler, TVM equipped with scheduler AutoTVM and Ansor, by up to 1.48x (1.22x on average) with enriched optimizations. It also reduces the tuning time by 20x and 11x compared with AutoTVM and Ansor, respectively.
翻译:由于目前深层次的学习模式被云层服务和边缘装置广泛采用,深层次的学习模式推断值的延迟度对于提供高效的模型服务至关重要。然而,由于现代加速器(如NVIDIA GPUs和Google TPUs)的高度复杂性以及操作员数量的迅速增长,为深层次的学习操作员开发高效的感应程序是具有挑战性的。 Apache TVM等深层次的学习编译员采用宣示性列表原始值来降低开发高压程序。然而,我们表明这一方法不足以覆盖最先进的高水平的高压程序优化(如双缓冲) 。 然而,由于现代加速器(如NVIDIA GPS和Google TTPS)的高度复杂性,我们建议将时间安排进程嵌入高压程序,并使用专用的绘图、称为任务映射图,以定义计算任务指派任务,并直接在高压程序中进行。这一新方法极大地丰富了演示,让开发者在更精细的阵列、更精确的阵列、更精确的阵列中(如允许程序调整程序调整,也通过我们更深层次的阵列)在20级的阵列、更深层的阵列、更深层的阵列的阵列的阵列中将D- 。我们把任务模拟的阵列的阵列的阵列的阵列中,把任务调整到更后演的阵列的阵列的阵列的阵式演示式演示式演示示。