PopSparse: IPU 上加速块稀疏矩阵乘法 (PopSparse: Accelerated block sparse matrix multiplication on IPU)

Reducing the computational cost of running large scale neural networks using sparsity has attracted great attention in the deep learning community. While much success has been achieved in reducing FLOP and parameter counts while maintaining acceptable task performance, achieving actual speed improvements has typically been much more difficult, particularly on general purpose accelerators (GPAs) such as NVIDIA GPUs using low precision number formats. In this work we introduce PopSparse, a library that enables fast sparse operations on Graphcore IPUs by leveraging both the unique hardware characteristics of IPUs as well as any block structure defined in the data. We target two different types of sparsity: static, where the sparsity pattern is fixed at compile-time; and dynamic, where it can change each time the model is run. We present benchmark results for matrix multiplication for both of these modes on IPU with a range of block sizes, matrix sizes and densities. Results indicate that the PopSparse implementations are faster than dense matrix multiplications on IPU at a range of sparsity levels with large matrix size and block size. Furthermore, static sparsity in general outperforms dynamic sparsity. While previous work on GPAs has shown speedups only for very high sparsity (typically 99\% and above), the present work demonstrates that our static sparse implementation outperforms equivalent dense calculations in FP16 at lower sparsity (around 90%). IPU code is available to view and run at ipu.dev/sparsity-benchmarks, GPU code will be made available shortly.

翻译：在深度学习领域，通过稀疏性降低大规模神经网络的计算成本已经引起极大关注。尽管在降低 FLOP 和参数数量的同时保持可接受的任务性能方面取得了很大的成功，但实际上实现速度改进通常更为困难，特别是在使用低精度数字格式的通用加速器（GPAs）例如 NVIDIA GPUs 上。在本研究中，我们引入了 PopSparse，这是一个库，通过利用 IPUs 的唯一硬件特性以及在数据中定义的任何块结构，使其能够在 Graphcore IPUs 上实现快速的稀疏操作。我们针对两种不同类型的稀疏性：静态稀疏性，在编译时固定稀疏模式;以及动态稀疏性，在每次运行模型时都可以更改。我们对在 IPU 上进行矩阵乘法的这两种模式以及一系列块大小、矩阵大小和密度进行了基准测试。结果表明，在大矩阵大小和块大小的稀疏级别范围内，PopSparse 实现比 IPU 上的密集矩阵乘法更快。此外，静态稀疏性总体上优于动态稀疏性。尽管之前在 GPAs 上的工作只显示了在非常高的稀疏度上的加速（通常 99% 及以上），但本研究表明，我们的静态稀疏实现在较低的稀疏度（约为 90％）下优于等效的 FP16 密集计算。 IPU 代码可在 ipu.dev/sparsity-benchmarks 查看和运行，GPU 代码不久将提供。