Many of today's deep neural network accelerators, e.g., Google's TPU and NVIDIA's tensor core, are built around accelerating the general matrix multiplication (i.e., GEMM). However, supporting convolution on GEMM-based accelerators is not trivial. The naive method explicitly lowers the convolution to GEMM, commonly known as im2col, which introduces significant performance and memory overhead. Existing implicit im2col algorithms require unscalable hardware and are inefficient in supporting important convolution variants such as strided convolution. In this paper, we propose a memory-efficient and hardware-friendly implicit im2col algorithm used by Google's TPU, which dynamically converts a convolution into a GEMM with practically zero performance and memory overhead, fully unleashing the power of GEMM engines. Through comprehensive experimental results, we quantitatively argue that this algorithm has been adopted in commercial closed-source platforms, and we are the first to describe its high-level idea and implementation details. Finally, we show that our algorithm can also be generally applied to Nvidia's Tensor Cores (TC), matching and out-performing the measured performance on TCs.
翻译:当今许多深层神经网络加速器,例如谷歌的TPU 和 NVIDIA 的强力核心,是围绕加速通用矩阵乘法(即 GEMM ) 建立起来的。然而,支持以 GEMM 为基础的加速器的演化并不是微不足道的。天真的方法明显降低了向GEMM的演化,即通常称为im2col的GEMM的演化,它带来显著的性能和记忆管理。现有的隐含的 im2col 算法需要不可扩缩的硬件,并且低效地支持重要的变体,如螺旋共振等。在本文中,我们提出了谷歌的 TPU 所使用的一个记忆高效和硬件友好的隐含的 I2col 算法,它能动态地将一个革命转换成一个具有实际零性能和记忆管理器的GEMM 加速器,充分释放GEMM 引擎的能量。通过全面的实验结果,我们从数量上说,这种算法已经在商业封闭源平台中被采用,我们第一个描述其高层次的想法和执行细节。最后,我们显示我们的算法还可以普遍应用到Ndvidia-expractualate-tradals 和Cressubals。