Tensor accelerators have gained popularity because they provide a cheap and efficient solution for speeding up computational-expensive tasks in Deep Learning and, more recently, in other Scientific Computing applications. However, since their features are specifically designed for tensor algebra (typically dense matrix-product), it is commonly assumed that they are not suitable for applications with sparse data. To challenge this viewpoint, we discuss methods and present solutions for accelerating sparse matrix multiplication on such architectures. In particular, we present a 1-dimensional blocking algorithm with theoretical guarantees on the density, which builds dense blocks from arbitrary sparse matrices. Experimental results show that, even for unstructured and highly-sparse matrices, our block-based solution which exploits Nvidia Tensor Cores is faster than its sparse counterpart. We observed significant speed-ups of up to two orders of magnitude on real-world sparse matrices.
翻译:电锯加速器越来越受欢迎,因为它们为在深层学习和最近的其他科学计算应用中加快计算成本任务提供了廉价而有效的解决方案。然而,由于它们的特性是专门为高代数(通常密度的矩阵产品)设计的,因此通常认为它们不适合数据稀少的应用。为了挑战这一观点,我们讨论加速这种结构的稀薄矩阵乘数的方法和提出解决办法。特别是,我们提出了一个一维阻塞算法,对密度提供理论保证,这种密度从任意的稀散矩阵中建立密度块。实验结果显示,即使对于结构松散和高度粗糙的矩阵,我们利用Nvidia Tensor Core的块基解决办法也比其稀薄的对口更快。我们观察到,在现实世界的稀散矩阵上出现了高达两级的大幅加速。