The generalization mystery in deep learning is the following: Why do over-parameterized neural networks trained with gradient descent (GD) generalize well on real datasets even though they are capable of fitting random datasets of comparable size? Furthermore, from among all solutions that fit the training data, how does GD find one that generalizes well (when such a well-generalizing solution exists)? We argue that the answer to both questions lies in the interaction of the gradients of different examples during training. Intuitively, if the per-example gradients are well-aligned, that is, if they are coherent, then one may expect GD to be (algorithmically) stable, and hence generalize well. We formalize this argument with an easy to compute and interpretable metric for coherence, and show that the metric takes on very different values on real and random datasets for several common vision networks. The theory also explains a number of other phenomena in deep learning, such as why some examples are reliably learned earlier than others, why early stopping works, and why it is possible to learn from noisy labels. Moreover, since the theory provides a causal explanation of how GD finds a well-generalizing solution when one exists, it motivates a class of simple modifications to GD that attenuate memorization and improve generalization. Generalization in deep learning is an extremely broad phenomenon, and therefore, it requires an equally general explanation. We conclude with a survey of alternative lines of attack on this problem, and argue that the proposed approach is the most viable one on this basis.
翻译:深层学习的概括性奥秘如下:为什么以梯度下降(GD)训练的过度量化神经网络在真实的数据集上非常普及,即使它们能够安装类似大小的随机数据集?此外,从符合培训数据的所有解决方案中,GD如何找到一个非常普及的解决方案(当这种广泛化的解决方案存在时)?我们争辩说,这两个问题的答案在于培训中不同实例的梯度的相互作用。直觉地说,如果每升梯度完全吻合,那就是如果GD具有一致性,那么人们可能会期望GD(在质量上)稳定,从而非常概括化。此外,我们将这一论点正式化,以易于比较和可解释的一致性衡量标准,并表明该指标在几个共同的视觉网络中,在真实和随机的数据集上具有非常不同的价值。理论还解释了深层次学习中的其他现象,例如为什么一些可行的例子比其他例子更早可靠地学习,为什么早期停止工作,以及为什么它有可能从激烈的标签上学习。此外,我们把这一理论变成一个整体的G的理论基础,因为一个整体的理论可以使G级的理论更深刻地解释成为一个整体的理论的理论的理论,从而要求一种普遍地解释。我们如何改进一个整体的理论,为什么一个整体地解释。