Learning rate schedule can significantly affect generalization performance in modern neural networks, but the reasons for this are not yet understood. Li-Wei-Ma (2019) recently proved this behavior can exist in a simplified non-convex neural-network setting. In this note, we show that this phenomenon can exist even for convex learning problems -- in particular, linear regression in 2 dimensions. We give a toy convex problem where learning rate annealing (large initial learning rate, followed by small learning rate) can lead gradient descent to minima with provably better generalization than using a small learning rate throughout. In our case, this occurs due to a combination of the mismatch between the test and train loss landscapes, and early-stopping.
翻译:学习进度表可以极大地影响现代神经网络的普及性表现,但原因尚不为人所知。 Li-Wei-Ma(2019年)最近证明,这种行为可以在简化的非convex神经网络环境中存在。在本说明中,我们表明,甚至对于 convex 学习问题,特别是两个维度的线性回归问题,也可以存在这种现象。我们给出了一个玩具连接问题,即学习率的断层(大规模初始学习率,然后是小型学习率)可能导致梯度下降至小型,比整个学习率低得多。就我们而言,这是测试与培训损失场景和早期停课之间不匹配的结果。