During the past decade, novel Deep Learning (DL) algorithms/workloads and hardware have been developed to tackle a wide range of problems. Despite the advances in workload/hardware ecosystems, the programming methodology of DL-systems is stagnant. DL-workloads leverage either highly-optimized, yet platform-specific and inflexible kernels from DL-libraries, or in the case of novel operators, reference implementations are built via DL-framework primitives with underwhelming performance. This work introduces the Tensor Processing Primitives (TPP), a programming abstraction striving for efficient, portable implementation of DL-workloads with high-productivity. TPPs define a compact, yet versatile set of 2D-tensor operators (or a virtual Tensor ISA), which subsequently can be utilized as building-blocks to construct complex operators on high-dimensional tensors. The TPP specification is platform-agnostic, thus code expressed via TPPs is portable, whereas the TPP implementation is highly-optimized and platform-specific. We demonstrate the efficacy of our approach using standalone kernels and end-to-end DL-workloads expressed entirely via TPPs that outperform state-of-the-art implementations on multiple platforms.
翻译:过去十年间,开发了新的深学习算法/工作负荷和硬件,以解决一系列广泛的问题。尽管工作量/硬件生态系统有所进步,但DL系统的程序设计方法停滞不前。DL工作负荷的杠杆作用要么是高度优化的,然而是平台特有和不灵活的DL图书馆的内核,要么是新操作者,通过DL框架原始系统建立参考实施,其性能低劣。这项工作引入了Tensor处理原始系统(TPP),这是为高效、可移植地执行DL工作负荷而设计的抽象程序。TPP定义了一套由 2D 10 操作者(或虚拟Tensor ISA)组成的多功能型集束,随后可以用作建筑砖块,用于建造高维电压操作员的复杂操作员。TPP规格是平台-不可知性,因此通过TPPs表达的代码是可移植的,而TPP执行方式则是高度可操作化的,而TPP是高可操作性地执行D-L 版本式的多式平台。我们通过TPL 演示展示了州端平台的效能。