实际上,我们是否真的需要高度超度测量?在分散培训中,需要时超度测量吗? (Do We Actually Need Dense Over-Parameterization? In-Time Over-Parameterization in Sparse Training)

from arxiv, 16 pages; 10 figures; Published in Proceedings of the 38th International Conference on Machine Learning. Code can be found https://github.com/Shiweiliuiiiiiii/In-Time-Over-Parameterization

In this paper, we introduce a new perspective on training deep neural networks capable of state-of-the-art performance without the need for the expensive over-parameterization by proposing the concept of In-Time Over-Parameterization (ITOP) in sparse training. By starting from a random sparse network and continuously exploring sparse connectivities during training, we can perform an Over-Parameterization in the space-time manifold, closing the gap in the expressibility between sparse training and dense training. We further use ITOP to understand the underlying mechanism of Dynamic Sparse Training (DST) and indicate that the benefits of DST come from its ability to consider across time all possible parameters when searching for the optimal sparse connectivity. As long as there are sufficient parameters that have been reliably explored during training, DST can outperform the dense neural network by a large margin. We present a series of experiments to support our conjecture and achieve the state-of-the-art sparse training performance with ResNet-50 on ImageNet. More impressively, our method achieves dominant performance over the overparameterization-based sparse methods at extreme sparsity levels. When trained on CIFAR-100, our method can match the performance of the dense model even at an extreme sparsity (98%). Code can be found https://github.com/Shiweiliuiiiiiii/In-Time-Over-Parameterization.

翻译：在本文中,我们引入了对深神经网络进行培训的新视角,这种网络无需花费昂贵的超分度,无需花费昂贵的超分度即可进行高神经网络培训。我们从随机分散的网络开始,在培训期间不断探索稀疏的连接,就可以在时空多处进行超分测量,缩小分散培训和密集培训之间在可显示性方面的差距。我们进一步利用ITOP来理解动态微粒培训的基本机制(DST),并表明DST的好处来自它的能力,即在寻找最理想的零星连接时,能够考虑所有可能的参数。只要在培训期间可靠地探讨足够的参数,DST就可以大大超越密集的神经网络。我们提出一系列实验,支持我们的预测,在图像网络上以ResNet-50ui为对象,并实现最先进的最先进的培训业绩。我们的方法在极端的螺旋程度上可以达到以超分数的稀少模式的主要性能。98在经过训练后,可以比得高的CAR-SBAR-RIFA方法。