Training a neural network requires choosing a suitable learning rate, involving a trade-off between speed and effectiveness of convergence. While there has been considerable theoretical and empirical analysis of how large the learning rate can be, most prior work focuses only on late-stage training. In this work, we introduce the maximal initial learning rate $\eta^{\ast}$ - the largest learning rate at which a randomly initialized neural network can successfully begin training and achieve (at least) a given threshold accuracy. Using a simple approach to estimate $\eta^{\ast}$, we observe that in constant-width fully-connected ReLU networks, $\eta^{\ast}$ demonstrates different behavior to the maximum learning rate later in training. Specifically, we find that $\eta^{\ast}$ is well predicted as a power of $(\text{depth} \times \text{width})$, provided that (i) the width of the network is sufficiently large compared to the depth, and (ii) the input layer of the network is trained at a relatively small learning rate. We further analyze the relationship between $\eta^{\ast}$ and the sharpness $\lambda_{1}$ of the network at initialization, indicating that they are closely though not inversely related. We formally prove bounds for $\lambda_{1}$ in terms of $(\text{depth} \times \text{width})$ that align with our empirical results.
翻译:培训神经网络需要选择合适的学习率, 包括速度和效果之间的权衡。 虽然已经对学习率能有多大进行大量的理论和经验分析, 但大部分先前的工作只侧重于后期培训。 在这项工作中, 我们引入了最大初始学习率$\eta\\ ⁇ ⁇ #ast}$这个最大的学习率, 随机初始的神经网络可以成功开始培训和达到( 至少)一个阈值准确度。 使用简单的方法估算$\eta{{{{{ast}, 我们观察到, 在固定的完全连接的ReLU网络中, $\eta{{{ast} 美元显示了与以后培训中最高学习率的不同行为。 具体地说, 我们发现, $\eta{{ast} $是最大的初始学习率( text{ text{\\ width} $的最大学习率, 前提是 (i) 网络的宽度与深度相当, (ii) 网络的输入层是相对较小的学习率。 我们进一步分析$\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\