We study the effect of mini-batching on the loss landscape of deep neural networks using spiked, field-dependent random matrix theory. We demonstrate that the magnitude of the extremal values of the batch Hessian are larger than those of the empirical Hessian. We also derive similar results for the Generalised Gauss-Newton matrix approximation of the Hessian. As a consequence of our theorems we derive an analytical expressions for the maximal learning rates as a function of batch size, informing practical training regimens for both stochastic gradient descent (linear scaling) and adaptive algorithms, such as Adam (square root scaling), for smooth, non-convex deep neural networks. Whilst the linear scaling for stochastic gradient descent has been derived under more restrictive conditions, which we generalise, the square root scaling rule for adaptive optimisers is, to our knowledge, completely novel. %For stochastic second-order methods and adaptive methods, we derive that the minimal damping coefficient is proportional to the ratio of the learning rate to batch size. We validate our claims on the VGG/WideResNet architectures on the CIFAR-$100$ and ImageNet datasets. Based on our investigations of the sub-sampled Hessian we develop a stochastic Lanczos quadrature based on the fly learning rate and momentum learner, which avoids the need for expensive multiple evaluations for these key hyper-parameters and shows good preliminary results on the Pre-Residual Architecure for CIFAR-$100$.
翻译:我们用高涨的、以实地为依存的随机矩阵理论,对深度神经网络的损失面貌进行小型分离的影响进行研究。我们证明,赫萨德批量的极限值比希萨德的实验性黑森理论大得多。我们也为赫萨德的通用高斯-纽顿矩阵近似值得出类似的结果。由于我们的理论,我们用批量大小的函数来分析最高学习率的分析表达方式,为随机梯度梯度下降(线度缩放)和适应性算法(如亚当(平方根缩放))的实际培训制度,为平滑、非convex深神经网络网络的末端值值值值值值量。虽然我们用更限制性的条件来计算出沙丘特梯度梯度梯度梯度下降的线度缩放量值,但根据我们的知识,适应性调适量的平方根比值是全新。 %的第二级测序方法和适应性方法,我们推算到最小的临界度系数与学习率比重的比重,如亚当(qreal rodeal) 根值比重的精度缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩数。我们验证了对模型的模型的模型和基图图。我们用的模型的模型的缩算算算算算算出了以图图图图图图图图。