Gradient-based learning in multi-layer neural networks displays a number of striking features. In particular, the decrease rate of empirical risk is non-monotone even after averaging over large batches. Long plateaus in which one observes barely any progress alternate with intervals of rapid decrease. These successive phases of learning often take place on very different time scales. Finally, models learnt in an early phase are typically `simpler' or `easier to learn' although in a way that is difficult to formalize. Although theoretical explanations of these phenomena have been put forward, each of them captures at best certain specific regimes. In this paper, we study the gradient flow dynamics of a wide two-layer neural network in high-dimension, when data are distributed according to a single-index model (i.e., the target function depends on a one-dimensional projection of the covariates). Based on a mixture of new rigorous results, non-rigorous mathematical derivations, and numerical simulations, we propose a scenario for the learning dynamics in this setting. In particular, the proposed evolution exhibits separation of timescales and intermittency. These behaviors arise naturally because the population gradient flow can be recast as a singularly perturbed dynamical system.
翻译:在多层神经网络中,基于渐进程度的多层神经网络学习显示了一系列惊人的特征。 特别是, 经验风险的下降速度即使在大批量平均之后也是非单分子的。 长高原,人们几乎没有观察到任何进展,而这种进展与迅速下降的间隔相交。 这些连续的学习阶段往往在非常不同的时间尺度上进行。 最后, 早期学习的模式通常是“ 简单”或“容易学习”的模型, 尽管很难正式确定。 尽管已经对这些现象提出了理论解释, 但每个现象的理论解释最多可以捕捉到某些特定的制度。 在本文中,我们研究一个大两层神经网络的梯度流动动态,在按照单一指数模型(即目标功能取决于对同系的单维值的单维度预测)进行数据分配时, 当数据按照单一指数模型(即目标函数取决于对同系的单维值的单维度预测时) 。 在新的严格结果、 非硬性数学推断和数字模拟的混合体中, 我们为这一环境的学习动态提出了一种设想。 特别是, 拟议的进化过程将时间尺度和跨层神经系统进行分离, 因为它们会自然产生。</s>