In this paper, we introduce a new perspective on training deep neural networks capable of state-of-the-art performance without the need for the expensive over-parameterization by proposing the concept of In-Time Over-Parameterization (ITOP) in sparse training. By starting from a random sparse network and continuously exploring sparse connectivities during training, we can perform an Over-Parameterization in the space-time manifold, closing the gap in the expressibility between sparse training and dense training. We further use ITOP to understand the underlying mechanism of Dynamic Sparse Training (DST) and indicate that the benefits of DST come from its ability to consider across time all possible parameters when searching for the optimal sparse connectivity. As long as there are sufficient parameters that have been reliably explored during training, DST can outperform the dense neural network by a large margin. We present a series of experiments to support our conjecture and achieve the state-of-the-art sparse training performance with ResNet-50 on ImageNet. More impressively, our method achieves dominant performance over the overparameterization-based sparse methods at extreme sparsity levels. When trained on CIFAR-100, our method can match the performance of the dense model even at an extreme sparsity (98%).
翻译:在本文中,我们从新的角度出发,培训能够达到最先进性能的深神经网络,而无需花费昂贵的超分度度,在稀少的训练中提出 " 时超光度 " 概念。我们从随机分散的网络开始,在训练期间不断探索稀少的连接,就可以在时空多处进行 " 超光度 ",缩小稀少训练与密集训练之间在可见性方面的差距。我们进一步利用ITOP来理解动态微粒训练的基本机制(DST),并表明DST的好处来自它在寻找最佳稀少连通性时全面考虑所有可能参数的能力。只要在训练期间可靠地探讨足够的参数,DST就可以大大超越密集的神经网络。我们提出一系列实验,支持我们的预测,在图像网络上用ResNet-50来达到最先进的培训业绩。更令人印象深刻的是,我们的方法能够超越以过度偏差为基础的零星模型,甚至达到极端偏差程度的密度。98在经过训练的CIRA-SA方法上可以匹配100。