It is widely believed that the success of deep networks lies in their ability to learn a meaningful representation of the features of the data. Yet, understanding when and how this feature learning improves performance remains a challenge: for example, it is beneficial for modern architectures trained to classify images, whereas it is detrimental for fully-connected networks trained for the same task on the same data. Here we propose an explanation for this puzzle, by showing that feature learning can perform worse than lazy training (via random feature kernel or the NTK) as the former can lead to a sparser neural representation. Although sparsity is known to be essential for learning anisotropic data, it is detrimental when the target function is constant or smooth along certain directions of input space. We illustrate this phenomenon in two settings: (i) regression of Gaussian random functions on the d-dimensional unit sphere and (ii) classification of benchmark datasets of images. For (i), we compute the scaling of the generalization error with number of training points, and show that methods that do not learn features generalize better, even when the dimension of the input space is large. For (ii), we show empirically that learning features can indeed lead to sparse and thereby less smooth representations of the image predictors. This fact is plausibly responsible for deteriorating the performance, which is known to be correlated with smoothness along diffeomorphisms.
翻译:人们广泛认为,深层次网络的成功取决于它们是否有能力学习对数据特征有意义的表述;然而,了解何时和如何了解该特征学习如何改善绩效仍然是一项挑战:例如,它有利于经过培训对图像进行分类的现代结构,而不利于为同一数据进行相同任务而培训的完全连接的网络;我们在此提出一个解释,通过显示特征学习比懒惰培训(随机特征内核或NTK)更差,因为前者可能导致神经系统代表的少。虽然已知广度对于学习厌食数据至关重要,但当目标功能在输入空间的某些方向上保持恒定或平稳时,则有害。我们用两种环境来说明这种现象:(一)高斯随机功能在d-维单位领域回归,(二)对图像基准数据集进行分类。关于(一),我们用培训点来理解一般错误的大小,并表明即使输入空间的层面巨大,也无法更好地学习一般特征,在输入空间的某些方向上保持平稳或平稳。我们用两种方式来说明这种现象:(二)在两种情况下,我们用不甚易变现的图像的特征来说明,我们确实能够以平滑的特征来了解。