During the past decade, novel Deep Learning (DL) algorithms/workloads and hardware have been developed to tackle a wide range of problems. Despite the advances in workload/hardware ecosystems, the programming methodology of DL systems is stagnant. DL workloads leverage either highly optimized, yet platform-specific and inflexible kernels from DL libraries, or in the case of novel operators, reference implementations are built via DL framework primitives with underwhelming performance. This work introduces the Tensor Processing Primitives (TPP), a programming abstraction striving for efficient, portable implementation of DL workloads with high productivity. TPPs define a compact, yet versatile set of 2D-tensor operators (or a virtual Tensor ISA), which subsequently can be utilized as building blocks to construct complex operators on high-dimensional tensors. The TPP specification is platform-agnostic, thus code expressed via TPPs is portable, whereas the TPP implementation is highly optimized and platform-specific. We demonstrate the efficacy of our approach using standalone kernels and end-to-end DL workloads expressed entirely via TPPs that outperform state-of-the-art implementations on multiple platforms.
翻译:过去十年间,开发了新的深层学习算法/工作负荷和硬件以解决一系列广泛的问题。尽管在工作量/硬件生态系统方面取得了进展,但DL系统的编程方法却停滞不前。DL工作量要么充分利用了来自DL图书馆的高度优化,然而却与平台有关且不灵活的内核,要么利用了来自DL图书馆的高度优化,然而,平台专用和不灵活的内核,或者在新操作者的情况下,通过DL框架原始软件建立了参考实施,其性能较低。这项工作引入了Tensor处理原始软件(TPP),这是为高效、可移植地执行高生产率的DL工作量而设计的方案抽象方案。TPP定义了一套由2D10操作操作操作员(或虚拟Tensor ISA)组成的集束,但多功能化的成套操作员(或虚拟Tensor ISA)随后可以用作建造高容量电压器复杂操作员的构件。TPP规格是平台的简单化代码,而TPP执行是高度优化和平台。我们展示了我们的方法的功效,通过独立核心和终端平台的多式平台执行。