向量核心的高效量化零散矩阵操作 (Efficient Quantized Sparse Matrix Operations on Tensor Cores)

from arxiv, Published in Proceedings of 2022 International Conference for High Performance Computing, Networking, Storage and Analysis (SC'22), Article No.: 37, Pages 1-15, Best Paper Finalist, https://dl.acm.org/doi/10.5555/3571885.3571934 (In this arXiv verion, we fix a typo at the bottom right of Page 6: For SDDMM, each thread block needs $\textbf{K/BS}$$_k$ steps to obtain the final results.)

The exponentially growing model size drives the continued success of deep learning, but it brings prohibitive computation and memory cost. From the algorithm perspective, model sparsification and quantization have been studied to alleviate the problem. From the architecture perspective, hardware vendors provide Tensor cores for acceleration. However, it is very challenging to gain practical speedups from sparse, low-precision matrix operations on Tensor cores, because of the strict requirements for data layout and lack of support for efficiently manipulating the low-precision integers. We propose Magicube, a high-performance sparse-matrix library for low-precision integers on Tensor cores. Magicube supports SpMM and SDDMM, two major sparse operations in deep learning with mixed precision. Experimental results on an NVIDIA A100 GPU show that Magicube achieves on average 1.44x (up to 2.37x) speedup over the vendor-optimized library for sparse kernels, and 1.43x speedup over the state-of-the-art with a comparable accuracy for end-to-end sparse Transformer inference.

翻译：倍增的模型规模促使深层学习继续取得成功,但它带来了令人望而却步的计算和记忆成本。从算法的角度来看,模型的封闭和量化已经研究过以缓解问题。从结构的角度来看,硬件供应商提供Tensor核心加速。然而,由于对数据布局的严格要求以及缺乏对高效操控低精度整数的支持,从Tensor核心的稀少、低精度的矩阵操作中获得实际的快速化是极具挑战性的。我们提议为Tensor核心低精度整数的高性稀释图书馆Magicube。Magiclube支持SpMM和SDDMMM,这是以混杂的精度进行深层学习的两大稀少操作。 NVIDIDA A100 GPU的实验结果显示,Magicube平均实现了1.44x(最高至2.37x)对供应商操作精密图书馆的加速度,以及1.43x在州艺术上加速,其端端至稀弱变换器的精确度相当精确。