Recent works have explored the use of weight sparsity to improve the training efficiency (test accuracy w.r.t training FLOPs) of deep neural networks (DNNs). These works aim to reduce training FLOPs but training with sparse weights often leads to accuracy loss or requires longer train schedules, making the resulting training efficiency less clear. In contrast, we focus on using sparsity to increase accuracy while using the same FLOPS as the dense model and show training efficiency gains through higher accuracy. In this work, we introduce SIFT, a family of Sparse Iso-FLOP Transformations which are used as drop-in replacements for dense layers to improve their representational capacity and FLOP efficiency. Each transformation is parameterized by a single parameter (sparsity level) and provides a larger search space to find optimal sparse masks. Without changing any training hyperparameters, replacing dense layers with SIFT leads to significant improvements across computer vision (CV) and natural language processing (NLP) tasks, including ResNet-18 on ImageNet (+3.5%) and GPT-3 Small on WikiText-103 (-0.4 PPL), both matching larger dense model variants with 2x or more FLOPs. To the best of our knowledge, this is the first work to demonstrate the use of sparsity for improving accuracy of dense models via a simple-to-use set of sparse transformations. Code is available at: https://github.com/CerebrasResearch/SIFT.
翻译:最近的工作探索了利用权重稀疏性来改善深度神经网络(DNN)的训练效率(测试准确性对训练FLOPs的影响)。这些工作旨在减少训练FLOPs,但是使用稀疏权重进行训练通常会导致准确性降低或需要更长的训练计划,从而使得结果的训练效率不太明确。相比之下,我们侧重于使用稀疏性来提高准确性,同时使用相同的FLOPs作为稠密模型,并通过更高的准确性展示了训练效率的提高。在这项工作中,我们引入了SIFT,一系列稀疏等FLOP转换,用作稠密层的drop-in替代品,以提高其表征能力和FLOP效率。每个变换都由单个参数(稀疏级别)进行参数化,并提供更大的搜索空间以找到最佳的稀疏掩模。在不改变任何培训超参数的情况下,将稠密层替换为SIFT会在计算机视觉(CV)和自然语言处理(NLP)任务中实现显著的改进,包括在ImageNet上的ResNet-18(+3.5%)和在WikiText-103上的GPT-3 Small(-0.4 PPL),两者都与具有2倍或更多FLOP的更大稠密模型变体匹配。据我们所知,这是首次通过一组易于使用的稀疏变换来演示利用稀疏性提高密集模型的准确性。代码可在以下网址获得:https://github.com/CerebrasResearch/SIFT。