It has been observed in practice that applying pruning-at-initialization methods to neural networks and training the sparsified networks can not only retain the testing performance of the original dense models, but also sometimes even slightly boost the generalization performance. Theoretical understanding for such experimental observations are yet to be developed. This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization. Specifically, this work considers a classification task for overparameterized two-layer neural networks, where the network is randomly pruned according to different rates at the initialization. It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero and the network exhibits good generalization performance. More surprisingly, the generalization bound gets better as the pruning fraction gets larger. To complement this positive result, this work further shows a negative result: there exists a large pruning fraction such that while gradient descent is still able to drive the training loss toward zero (by memorizing noise), the generalization performance is no better than random guessing. This further suggests that pruning can change the feature learning process, which leads to the performance drop of the pruned neural network.
翻译:在实践中,人们观察到,对神经网络和环球网络培训应用分线初始化方法不仅能够保留原始密度模型的测试性能,有时甚至略微提升一般化性能。对于这种实验性观测的理论理解尚有待开发。这项工作首次尝试研究不同分线分线如何影响模型的梯度下移动态和概括化。具体地说,这项工作考虑了超分线化双层神经网络的分类任务,因为网络根据初始化的不同速度随机调整。事实证明,只要分线分线部分低于某一阈值,梯度下降就会将培训损失推向零,而网络则显示良好的一般化性能。更令人惊讶的是,由于分线性能增加,一般化的界限会变得更好。为了补充这一积极的结果,这项工作还进一步显示出一个负面的结果:存在一个很大的分线化部分,因为梯度下降仍然能够将培训损失推向零(通过微调噪音),一般化性能比随机测得更好。这进一步表明,光度的运行过程可以导致业绩的变化。