The ability of deep neural networks to generalise well even when they interpolate their training data has been explained using various "simplicity biases". These theories postulate that neural networks avoid overfitting by first learning simple functions, say a linear classifier, before learning more complex, non-linear functions. Meanwhile, data structure is also recognised as a key ingredient for good generalisation, yet its role in simplicity biases is not yet understood. Here, we show that neural networks trained using stochastic gradient descent initially classify their inputs using lower-order input statistics, like mean and covariance, and exploit higher-order statistics only later during training. We first demonstrate this distributional simplicity bias (DSB) in a solvable model of a neural network trained on synthetic data. We empirically demonstrate DSB in a range of deep convolutional networks and visual transformers trained on CIFAR10, and show that it even holds in networks pre-trained on ImageNet. We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of Gaussian universality in learning.
翻译:这些理论假设神经网络避免在学习更复杂、非线性功能之前,先学习简单功能,比如线性分类器,然后学习更复杂的非线性功能。 同时,数据结构也被视为良好概括的一个关键要素,然而,它对于简单偏差的作用尚未被理解。在这里,我们表明,使用随机梯度下降进行训练的神经网络最初使用低级输入统计(如中值和易变性)对其投入进行分类,并在培训期间稍后再利用更高层次的统计数据。我们首先在经过合成数据培训的可溶性神经网络模型中展示这种分布性简单偏差(DSB)。我们从经验上证明DSB是一系列深层革命网络和在CIFAR10上受过训练的视觉变异器,并显示它甚至保存在经过预先训练的图像网络中。我们讨论了DSB与其他简单偏见的关系,并审议了其对高斯学习普遍性原则的影响。