Exploiting sparsity underlying neural networks has become one of the most potential methodologies to reduce the memory footprint, I/O cost, and computation workloads during inference. And the degree of sparsity one can exploit has become higher as larger model sizes have been considered along with the trend of pre-training giant models. On the other hand, compared with quantization that has been a widely supported option, acceleration through high-degree sparsity is not supported in most computing platforms. In this work, we introduce the first commercial hardware platform supporting high-degree sparsity acceleration up to 32 times -- S4. Combined with state-of-the-art sparse pruning techniques, we demonstrate several-times practical inference speedup on S4 over mainstream inference platforms such as Nvidia T4. We also show that in practice a sparse model of larger size can achieve both higher accuracy and higher throughput on S4 than a dense model of smaller size.
翻译:在推断期间,利用神经网络背后的广度爆炸已成为减少内存足迹、I/O成本和计算工作量的最可能的方法之一。随着在研究培训前巨型模型的趋势的同时,还考虑了较大的模型尺寸和训练前巨型模型的趋势,人们可以利用的宽度已经变得更高。另一方面,与被广泛支持的量化办法相比,大多数计算平台都不支持高度散度加速。在这项工作中,我们引入了第一个商业硬件平台,支持高度宽度加速32倍 -- -- S4。与最先进的稀薄理算技术相结合,我们展示了几时S4在Nvidia T4等主流引力平台上的实际推断速度。我们还表明,在实践上,较小型的密集模型而言,规模小的稀小模型既能达到更高的精确度,S4的吞吐量也更高。