Most theoretical studies explaining the regularization effect in deep learning have only focused on gradient descent with a sufficient small learning rate or even gradient flow (infinitesimal learning rate). Such researches, however, have neglected a reasonably large learning rate applied in most practical applications. In this work, we characterize the implicit bias effect of deep linear networks for binary classification using the logistic loss in the large learning rate regime, inspired by the seminal work by Lewkowycz et al. [26] in a regression setting with squared loss. They found a learning rate regime with a large stepsize named the catapult phase, where the loss grows at the early stage of training and eventually converges to a minimum that is flatter than those found in the small learning rate regime. We claim that depending on the separation conditions of data, the gradient descent iterates will converge to a flatter minimum in the catapult phase. We rigorously prove this claim under the assumption of degenerate data by overcoming the difficulty of the non-constant Hessian of logistic loss and further characterize the behavior of loss and Hessian for non-separable data. Finally, we demonstrate that flatter minima in the space spanned by non-separable data along with the learning rate in the catapult phase can lead to better generalization empirically.
翻译:解释深层次学习的正规化效果的大多数理论研究都只侧重于梯度下降,其学习率足够小,甚至梯度流动(不完全的学习率),但这类研究忽略了在最实际应用中应用的相当大学习率;在这项工作中,我们把深线网络的内在偏差效应定性为在大型学习率制度中利用后勤损失进行二线分类,这是由Lewkowycz等人等人(26)在平方损失的回归环境下进行的开创性工作所启发的。他们发现一种学习率制度,它有一个称为弹射阶段,在早期培训阶段损失增加,最终会达到比在小学习率制度中发现的低得多的最低程度。我们声称,根据数据分离条件,在大学习率制度中,梯度下降将集中到一个最优的最低限度。我们严格地证明,在假设数据退化的情况下,克服了不连续的Hessian后勤损失的困难,并进一步说明损失和赫萨尼亚人对非可测量性数据的行为。最后,我们以非可测量性的一般学习率证明,在不易空间学习阶段,我们通过缩缩取的阶段可以证明,在非空间学习率中可以改进。