最高KAST:最高K总是粗糙的培训 (Top-KAST: Top-K Always Sparse Training)

Sparse neural networks are becoming increasingly important as the field seeks to improve the performance of existing models by scaling them up, while simultaneously trying to reduce power consumption and computational footprint. Unfortunately, most existing methods for inducing performant sparse models still entail the instantiation of dense parameters, or dense gradients in the backward-pass, during training. For very large models this requirement can be prohibitive. In this work we propose Top-KAST, a method that preserves constant sparsity throughout training (in both the forward and backward-passes). We demonstrate the efficacy of our approach by showing that it performs comparably to or better than previous works when training models on the established ImageNet benchmark, whilst fully maintaining sparsity. In addition to our ImageNet results, we also demonstrate our approach in the domain of language modeling where the current best performing architectures tend to have tens of billions of parameters and scaling up does not yet seem to have saturated performance. Sparse versions of these architectures can be run with significantly fewer resources, making them more widely accessible and applicable. Furthermore, in addition to being effective, our approach is straightforward and can easily be implemented in a wide range of existing machine learning frameworks with only a few additional lines of code. We therefore hope that our contribution will help enable the broader community to explore the potential held by massive models, without incurring massive computational cost.

翻译：随着实地努力通过扩大现有模型,同时努力减少电力消耗和计算足迹来提高现有模型的性能,松散的神经网络变得越来越重要。不幸的是,大多数引导性能稀疏模型的现有方法仍意味着在培训期间即刻采用密集参数,或落后通道的密度梯度。对于非常大的模型来说,这一要求可能令人望而却步。在这个工作中,我们提出了Top-KAST,这个方法在整个培训(前方和后方)中保持了常态的广度。我们展示了我们方法的功效,显示它比以往在既定的图像网络基准培训模型上比以往工作有可比性或更好,同时充分保持了宽度。除了我们的图像网络结果外,我们还展示了我们在语言建模领域的做法,目前最佳的运行架构往往有数百亿个参数,而扩大规模似乎还没有饱和性。这些架构的简陋版本可以使用大量资源运行,使这些结构更加普及和适用。此外,除了有效外,我们的方法还可以直截然和容易地帮助在广范围内实施我们现有的大量贡献模型,因此,我们将通过大规模探索成本的计算,使更多的各种潜在模型能够产生更多的希望。