A key challenge in building theoretical foundations for deep learning is the complex optimization dynamics of neural networks, resulting from the high-dimensional interactions between the large number of network parameters. Such non-trivial dynamics lead to intriguing behaviors such as the phenomenon of "double descent" of the generalization error. The more commonly studied aspect of this phenomenon corresponds to model-wise double descent where the test error exhibits a second descent with increasing model complexity, beyond the classical U-shaped error curve. In this work, we investigate the origins of the less studied epoch-wise double descent in which the test error undergoes two non-monotonous transitions, or descents as the training time increases. By leveraging tools from statistical physics, we study a linear teacher-student setup exhibiting epoch-wise double descent similar to that in deep neural networks. In this setting, we derive closed-form analytical expressions for the evolution of generalization error over training. We find that double descent can be attributed to distinct features being learned at different scales: as fast-learning features overfit, slower-learning features start to fit, resulting in a second descent in test error. We validate our findings through numerical experiments where our theory accurately predicts empirical findings and remains consistent with observations in deep neural networks.
翻译:在为深层学习建立理论基础方面的一个关键挑战是神经网络的复杂优化动态,这是大量网络参数之间高度互动造成的。这种非三角动态导致令人感兴趣的行为,例如一般错误的“双向下降”现象。更常见的研究方面相当于模型-先入为主的双向下降,测试错误显示的二次下降,其模型复杂性超过典型的U型误差曲线。在这项工作中,我们调查了研究较少研究的古老的双向双向下降的起源,测试错误在其中经历两种非单向过渡,或随着培训时间的增加而下降。通过利用统计物理工具,我们研究了直线教师-学生设置,展示了类似于深层神经网络的近向型双向下降。在这种背景下,我们为一般误差的演演演演演过程提供了封闭式的分析表达方式。我们发现,双向下降可归因于不同规模的特征:快速学习特征超越、学习速度慢的特征开始适应于两个非单向过渡,从而导致在深层实验中进行连续的实验结果。我们通过精确的实验网络来验证我们如何预测。