Deep neural networks (DNNs) defy the classical bias-variance trade-off: adding parameters to a DNN that interpolates its training data will typically improve its generalization performance. Explaining the mechanism behind this ``benign overfitting'' in deep networks remains an outstanding challenge. Here, we study the last hidden layer representations of various state-of-the-art convolutional neural networks and find that if the last hidden representation is wide enough, its neurons tend to split into groups that carry identical information, and differ from each other only by statistically independent noise. The number of such groups increases linearly with the width of the layer, but only if the width is above a critical value. We show that redundant neurons appear only when the training process reaches interpolation and the training error is zero.
翻译:深度神经网络(DNN)挑战了经典的偏差-方差权衡:给插值其训练数据的DNN添加参数通常会提高其泛化性能。解释深层网络中“良性过拟合”的机制仍然是一个未解决的难题。在这里,我们研究了各种最先进的卷积神经网络的最后一个隐藏层表示,并发现如果最后一个隐藏表示足够宽,它的神经元往往会分成携带相同信息的组,这些组之间只有统计独立的噪音不同。这些组的数量随着层宽的增加线性增加,但仅当宽度高于临界值时。我们展示了只有当训练过程达到插值并训练误差为零时,冗余的神经元才会出现。