The benefits of over-parameterization in achieving superior generalization performance have been shown in several recent studies, justifying the trend of using larger models in practice. In the context of robust learning however, the effect of neural network size has not been well studied. In this work, we find that in the presence of a substantial fraction of mislabeled examples, increasing the network size beyond some point can be harmful. In particular, the originally monotonic or `double descent' test loss curve (w.r.t. network width) turns into a U-shaped or a double U-shaped curve when label noise increases, suggesting that the best generalization is achieved by some model with intermediate size. We observe that when network size is controlled by density through random pruning, similar test loss behaviour is observed. We also take a closer look into both phenomenon through bias-variance decomposition and theoretically characterize how label noise shapes the variance term. Similar behavior of the test loss can be observed even when state-of-the-art robust methods are applied, indicating that limiting the network size could further boost existing methods. Finally, we empirically examine the effect of network size on the smoothness of learned functions, and find that the originally negative correlation between size and smoothness is flipped by label noise.
翻译:最近的几项研究显示,超度参数化对于实现超常通用性能的好处是超常通用性表现的好处,这证明在实践中使用较大模型的趋势是合理的。然而,在扎实的学习中,没有很好地研究神经网络规模的影响。在这项工作中,我们发现,在大量贴错标签的例子存在的情况下,将网络规模扩大到某一点以外可能有害。特别是,最初的单调或“双向下降”测试损失曲线(w.r.t.网络宽度)在标签噪音增加时会变成U型或双向U型曲线,表明最佳的通用是通过中间尺寸的某些模型实现的。我们观察到,当网络规模受密度控制时,通过随机裁剪裁,也观察到类似的测试损失行为。我们还通过偏差变异性解变异和理论上地审视两种现象如何形成差异性术语。即使在采用最先进的方法时,也可以看到测试损失的类似行为,表明限制网络规模可以进一步提升现有方法。最后,我们通过实验性地研究网络规模和平稳度在最初的标签上,通过光滑度之间,可以发现网络规模和平稳度的相互影响。