Understanding how feature learning affects generalization is among the foremost goals of modern deep learning theory. Here, we study how the ability to learn representations affects the generalization performance of a simple class of models: deep Bayesian linear neural networks trained on unstructured Gaussian data. By comparing deep random feature models to deep networks in which all layers are trained, we provide a detailed characterization of the interplay between width, depth, data density, and prior mismatch. We show that both models display sample-wise double-descent behavior in the presence of label noise. Random feature models can also display model-wise double-descent if there are narrow bottleneck layers, while deep networks do not show these divergences. Random feature models can have particular widths that are optimal for generalization at a given data density, while making neural networks as wide or as narrow as possible is always optimal. Moreover, we show that the leading-order correction to the kernel-limit learning curve cannot distinguish between random feature models and deep networks in which all layers are trained. Taken together, our findings begin to elucidate how architectural details affect generalization performance in this simple class of deep regression models.
翻译:了解特征学习如何影响概括化是现代深层学习理论的首要目标之一。 在这里, 我们研究学习表现的能力如何影响简单模型类型的一般性能: 深贝伊西亚线性神经网络, 受过非结构化高斯数据的培训。 通过将深随机特征模型与所有层次都受过培训的深网络进行比较, 我们对宽度、 深度、 数据密度 和先前不匹配 之间的相互作用进行详细描述。 我们显示, 两种模型在标签噪声面前都展示了样性双层双层行为 。 随机特征模型也可以显示模式性双层, 如果有狭窄的瓶颈层, 而深层网络不显示这些差异 。 随机特征模型可以具有特定宽度, 在给定的数据密度上最合适, 使神经网络尽可能宽或窄, 总是最优化 。 此外, 我们显示, 内核限制学习曲线的导序修正无法区分随机特征模型和所有层次都受过训练的深层网络 。 一起, 我们的发现开始解释建筑细节如何影响这一简单级的深度模型的概括性表现 。