Matrix-multiplication units (MXUs) are now prevalent in every computing platform. The key attribute that makes MXUs so successful is the semiring structure, which allows tiling for both parallelism and data reuse. Nonetheless, matrix-multiplication is not the only algorithm with such attributes. We find that many algorithms share the same structure and differ in only the core operation; for example, using add-minimum instead of multiply-add. Algorithms with a semiring-like structure therefore have potential to be accelerated by a general-purpose matrix operation architecture, instead of common MXUs. In this paper, we propose SIMD$^2$, a new programming paradigm to support generalized matrix operations with a semiring-like structure. SIMD$^2$ instructions accelerate eight more types of matrix operations, in addition to matrix multiplications. Since SIMD$^2$ instructions resemble a matrix-multiplication instruction, we are able to build SIMD$^2$ architecture on top of any MXU architecture with minimal modifications. We developed a framework that emulates and validates SIMD$^2$ using NVIDIA GPUs with Tensor Cores. Across 8 applications, SIMD2 provides up to 38.59$\times$ speedup and more than 10.63$\times$ on average over optimized CUDA programs, with only 5% of full-chip area overhead.
翻译:使 MXU 如此成功的关键属性是半导体结构,它允许平行和数据再利用。然而,矩阵倍增并不是具有这些属性的唯一算法。我们发现,许多算法具有相同的结构,而且只在核心操作中存在差异;例如,使用增量最小值而不是增量多元值;因此,具有半环结构的乘数值有可能通过一个通用矩阵操作结构加速,而不是普通的MXU加速。在本文中,我们提议SIMD$2,这是一个支持具有半环结构的通用矩阵操作的新编程模式。SIMD$2,除了矩阵倍增外,SIMD$8的指令加速了8种以上的矩阵操作。由于SIMD$2的指令类似于矩阵倍增指示,因此在任何MXU结构的顶部上只能建立SIMD$2$的优化结构,但只有最低限度的修改。我们开发了一个框架,在SIMD$10.%2美元以上的SIMD$,利用NVISAGPOs 提供超过SA8的普通程序。