Sextans: 普通聚变蒸汽- Matrix tense- Matrix 乘法的串流加速器 (Sextans: A Streaming Accelerator for General-Purpose Sparse-Matrix Dense-Matrix Multiplication)

Sparse-Matrix Dense-Matrix multiplication (SpMM) is the key operator for a wide range of applications, including scientific computing, graph processing, and deep learning. Architecting accelerators for SpMM is faced with three challenges - (1) the random memory accessing and unbalanced load in processing because of random distribution of elements in sparse matrices, (2) inefficient data handling of the large matrices which can not be fit on-chip, and (3) anon-general-purpose accelerator design where one accelerator can only process a fixed-size problem. In this paper, we present Sextans, an accelerator for general-purpose SpMM processing. Sextans accelerator features (1) fast random access using on-chip memory, (2) streaming access to off-chip large matrices, (3) PE-aware non-zero scheduling for balanced workload with an II=1 pipeline, and (4) hardware flexibility to enable prototyping the hardware once to support SpMMs of different size as a general-purpose accelerator. We leverage high bandwidth memory (HBM) for the efficient accessing of both sparse and dense matrices. In the evaluation, we present an FPGA prototype Sextans which is executable on a Xilinx U280 HBM FPGA board and a projected prototype Sextans-P with higher bandwidth comparable to V100 and more frequency optimization. We conduct a comprehensive evaluation on 1,400 SpMMs on a wide range of sparse matrices including 50 matrices from SNAP and 150 from SuiteSparse. WecompareSextanswith NVIDIA K80 and V100 GPUs.Sextansachieves a 2.50x geomean speedup over K80 GPU andSextans-Pachieves a 1.14x geomean speedup over V100 GPU (4.94x over K80).

翻译：用于 SpMM 的建筑加速器面临三个挑战:(1) 随机存取存储器和处理过程中的不平衡负荷,因为随机分布在稀释矩阵中的元素;(2) 大型矩阵的数据处理效率低下,无法适应芯片;(3) 通用加速器只能处理固定规模问题,即一台加速器只能处理一个固定规模问题。在本文件中,我们介绍Sexttans,一个用于通用SpM 处理的加速器。 SpM 的系统加速器有三种挑战:(1) 随机存取和处理过程中的不平衡负荷,因为随机分配了稀释矩阵中的元素;(2) 大型矩阵的数据处理处理效率低下,这不适合安装在芯片上;(3) 通用加速器只能处理一个固定规模的问题;以及(3) 通用加速器只能处理一个固定规模问题。