Gradient-based learning in multi-layer neural networks displays a number of striking features. In particular, the decrease rate of empirical risk is non-monotone even after averaging over large batches. Long plateaus in which one observes barely any progress alternate with intervals of rapid decrease. These successive phases of learning often take place on very different time scales. Finally, models learnt in an early phase are typically `simpler' or `easier to learn' although in a way that is difficult to formalize. Although theoretical explanations of these phenomena have been put forward, each of them captures at best certain specific regimes. In this paper, we study the gradient flow dynamics of a wide two-layer neural network in high-dimension, when data are distributed according to a single-index model (i.e., the target function depends on a one-dimensional projection of the covariates). Based on a mixture of new rigorous results, non-rigorous mathematical derivations, and numerical simulations, we propose a scenario for the learning dynamics in this setting. In particular, the proposed evolution exhibits separation of timescales and intermittency. These behaviors arise naturally because the population gradient flow can be recast as a singularly perturbed dynamical system.
翻译:多层神经网络的基于梯度的学习表现出许多显著特征。特别是,即使在大批处理上平均后,经验风险的减少速率也是不单调的。几乎没有进展的长平台交替出现,与快速下降的间隔相互交替。这些连续的学习阶段通常采用非常不同的时间尺度。最后,在早期阶段学习的模型通常是“更简单”或“更容易学习”,尽管这是一种很难形式化的方式。虽然已经提出了这些现象的理论解释,但是每个解释最多只能捕捉到某些特定的区域。在本文中,我们研究了高维下分布根据单指数模型(即,目标函数依赖于协变量的一维投影)分布的宽两层神经网络的梯度流动力学。基于新的严格结果、非严格的数学推导和数值模拟的混合物,我们提出了在这种情况下学习动态的一种情景。特别是,所提出的演化呈现出时间尺度分离和间歇性。这些行为是自然而然产生的,因为人口梯度流可以被重构为一个奇异扰动动力系统。