Correctly choosing a learning rate (scheme) for gradient-based optimization is vital in deep learning since different learning rates may lead to considerable differences in optimization and generalization. As found by Lewkowycz et al. \cite{lewkowycz2020large} recently, there is a learning rate phase with large stepsize named {\it catapult phase}, where the loss grows at the early stage of training, and optimization eventually converges to a flatter minimum with better generalization. While this phenomenon is valid for deep neural networks with mean squared loss, it is an open question whether logistic (cross-entropy) loss still has a {\it catapult phase} and enjoys better generalization ability. This work answeres this question by studying deep linear networks with logistic loss. We find that the large learning rate phase is closely related to the separability of data. The non-separable data results in the {\it catapult phase}, and thus flatter minimum can be achieved in this learning rate phase. We demonstrate empirically that this interpretation can be applied to real settings on MNIST and CIFAR10 datasets with the fact that the optimal performance is often found in this large learning rate phase.
翻译:正确选择学习率( scheme) 用于梯度优化对于深层学习至关重要, 因为不同的学习率可能导致在优化和普及方面差异很大。 正如Lewkowycz 等人( cite{lewkowycz2020prag})最近发现的那样, 存在一个学习率阶段, 该阶段的深度线性网络被命名为 prit dispapult section }, 损失在早期培训阶段逐渐增加, 而优化最终会以更好的概括化方式达到平方化的平方值最小平方值。 这种现象对于深层神经网络是有效的, 但对于物流( 跨渗透性) 损失是否仍然具有 ~it 弹射阶段 } 是一个开放的问题。 这项工作通过研究深度线性网络和物流损失 来回答这个问题。 我们发现, 大学习率阶段与数据的可分离性能密切相关。 在 huit expublical legal slection 10 中, 并且在此学习速率阶段可以实现。