It has been observed in practice that applying pruning-at-initialization methods to neural networks and training the sparsified networks can not only retain the testing performance of the original dense models, but also sometimes even slightly boost the generalization performance. Theoretical understanding for such experimental observations are yet to be developed. This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization. Specifically, this work considers a classification task for overparameterized two-layer neural networks, where the network is randomly pruned according to different rates at the initialization. It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero and the network exhibits good generalization performance. More surprisingly, the generalization bound gets better as the pruning fraction gets larger. To complement this positive result, this work further shows a negative result: there exists a large pruning fraction such that while gradient descent is still able to drive the training loss toward zero (by memorizing noise), the generalization performance is no better than random guessing. This further suggests that pruning can change the feature learning process, which leads to the performance drop of the pruned neural network. Up to our knowledge, this is the \textbf{first} generalization result for pruned neural networks, suggesting that pruning can improve the neural network's generalization.
翻译:在实践中,人们观察到,对神经网络和环球网络的培训应用超光线初始化方法不仅能够保留原始密度模型的测试性能,有时甚至略微提升一般化性能。对于这种实验性观测的理论理解尚有待开发。 这项工作首次尝试研究不同微线分数如何影响模型的梯度下移动态和概括化。 具体地说, 这项工作考虑过分的双层神经网络的分类任务, 网络根据初始化的不同速度随机调整。 显示只要精度分数低于某一阈值, 梯度下降会将培训损失推向零, 而网络则显示良好的概括性性能。 更令人惊讶的是, 普通性能随着微分数的增大, 总体性能会变得更好一些。 为了补充这一积极的结果, 这项工作还进一步显示了一个负面的结果: 存在一个大的微线性分数, 梯度下降仍然能够将培训损失推向零( 以微调噪声) 。 一般性化性能比随机性线性下降更好, 显示一般网络的运行性能向一般性化结果。 这进一步显示, 学习性能可以向一般性能 向一般性 。