Today's auto-tuners (e.g., AutoTVM, Ansor) generate efficient tensor programs by navigating a large search space to identify effective implementations, but they do so with opaque hardware details. Thus, their performance could fall behind that of hardware-native libraries (e.g., cuBLAS, cuDNN), which are hand-optimized by device vendors to extract high performance. On the other hand, these vendor libraries have a fixed set of supported functions and lack the customization and automation support afforded by auto-tuners. Bolt is based on the recent trend that vendor libraries are increasingly modularized and reconfigurable via declarative control (e.g., CUTLASS). It enables a novel approach that bridges this gap and achieves the best of both worlds, via hardware-native templated search. Bolt provides new opportunities to rethink end-to-end tensor optimizations at the graph, operator, and model levels. Bolt demonstrates this concept by prototyping on a popular auto-tuner in TVM and a class of widely-used platforms (i.e., NVIDIA GPUs) -- both in large deployment in our production environment. Bolt improves the inference speed of common convolutional neural networks by 2.5x on average over the state of the art, and it auto-tunes these models within 20 minutes.
翻译:今天的自动调试器(例如,AutoTVM, Ansor)通过浏览大型搜索空间来识别有效的实施,从而产生高效的调试程序(例如,AutoTVM, Ansor),通过浏览大型搜索空间来确定有效的执行,但是它们使用不透明的硬件细节。因此,它们的性能可能落后于硬件图书馆(例如,cubBLAS, cuDNN),这些图书馆被设备供应商亲手优化,以获得高性能。另一方面,这些供应商图书馆拥有一套固定的支持功能,缺乏自动调试器提供的定制和自动化支持。Bolt基于最近的趋势,即销售图书馆通过宣示性控制(例如, CUTLASS)日益模块化和可重新配置。它使得一种新颖的方法能够弥合这一差距,并通过硬件模板搜索实现两个世界的最佳。 Bolt提供了在图形、操作器和模型级别上重新思考端端到端至端调优化功能的新机会。Bolt通过在电视的通用自动图解器和高压平台内部的大型配置平台中改进GVI。