学习率作为批量大小的函数:神经网络培训的随机矩阵理论方法 (Learning Rates as a Function of Batch Size: A Random Matrix Theory Approach to Neural Network Training)

We study the effect of mini-batching on the loss landscape of deep neural networks using spiked, field-dependent random matrix theory. We demonstrate that the magnitude of the extremal values of the batch Hessian are larger than those of the empirical Hessian. We also derive similar results for the Generalised Gauss-Newton matrix approximation of the Hessian. As a consequence of our theorems we derive an analytical expressions for the maximal learning rates as a function of batch size, informing practical training regimens for both stochastic gradient descent (linear scaling) and adaptive algorithms, such as Adam (square root scaling), for smooth, non-convex deep neural networks. Whilst the linear scaling for stochastic gradient descent has been derived under more restrictive conditions, which we generalise, the square root scaling rule for adaptive optimisers is, to our knowledge, completely novel. %For stochastic second-order methods and adaptive methods, we derive that the minimal damping coefficient is proportional to the ratio of the learning rate to batch size. We validate our claims on the VGG/WideResNet architectures on the CIFAR-$100$ and ImageNet datasets. Based on our investigations of the sub-sampled Hessian we develop a stochastic Lanczos quadrature based on the fly learning rate and momentum learner, which avoids the need for expensive multiple evaluations for these key hyper-parameters and shows good preliminary results on the Pre-Residual Architecure for CIFAR-$100$.

翻译：我们用高涨的、以实地为依存的随机矩阵理论,对深度神经网络的损失面貌进行小型分离的影响进行研究。我们证明,赫萨德批量的极限值比希萨德的实验性黑森理论大得多。我们也为赫萨德的通用高斯-纽顿矩阵近似值得出类似的结果。由于我们的理论,我们用批量大小的函数来分析最高学习率的分析表达方式,为随机梯度梯度下降(线度缩放)和适应性算法(如亚当(平方根缩放))的实际培训制度,为平滑、非convex深神经网络网络的末端值值值值值值量。虽然我们用更限制性的条件来计算出沙丘特梯度梯度梯度梯度下降的线度缩放量值,但根据我们的知识,适应性调适量的平方根比值是全新。 %的第二级测序方法和适应性方法,我们推算到最小的临界度系数与学习率比重的比重,如亚当(qreal rodeal) 根值比重的精度缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩数。我们验证了对模型的模型的模型和基图图。我们用的模型的模型的缩算算算算算算出了以图图图图图图图图。

相关内容

矩阵论

关注 6

随着科学技术的迅速发展，古典的线性代数知识已不能满足现代科技的需要，矩阵的理论和方法业已成为现代科技领域必不可少的工具。诸如数值分析、优化理论、微分方程、概率统计、控制论、力学、电子学、网络等学科领域都与矩阵理论有着密切的联系，甚至在经济管理、金融、保险、社会科学等领域，矩阵理论和方法也有着十分重要的应用。当今电子计算机及计算技术的迅速发展为矩阵理论的应用开辟了更广阔的前景。因此，学习和掌握矩阵的基本理论和方法，对于工科研究生来说是必不可少的。全国的工科院校已普遍把“矩阵论”作为研究生的必修课。

INRIA最新「机器学习理论」新书，229页pdf原理性阐述机器学习

专知会员服务

69+阅读 · 2021年3月27日

Fariz Darari简明《博弈论Game Theory》介绍，35页ppt

专知会员服务

111+阅读 · 2020年5月15日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【论文】深度学习的最优化:理论和算法（Optimization for deep learning: theory and algorithms）

专知会员服务

148+阅读 · 2019年12月28日