Training deep neural networks using a large batch size has shown promising results and benefits many real-world applications. However, the optimizer converges slowly at early epochs and there is a gap between large-batch deep learning optimization heuristics and theoretical underpinnings. In this paper, we propose a novel Complete Layer-wise Adaptive Rate Scaling (CLARS) algorithm for large-batch training. We also analyze the convergence rate of the proposed method by introducing a new fine-grained analysis of gradient-based methods. Based on our analysis, we bridge the gap and illustrate the theoretical insights for three popular large-batch training techniques, including linear learning rate scaling, gradual warmup, and layer-wise adaptive rate scaling. Extensive experiments demonstrate that the proposed algorithm outperforms gradual warmup technique by a large margin and defeats the convergence of the state-of-the-art large-batch optimizer in training advanced deep neural networks (ResNet, DenseNet, MobileNet) on ImageNet dataset.
翻译:使用大型批量规模的深层神经网络培训显示有希望的结果,并有益于许多现实世界应用。然而,优化器在早期缓慢地聚集在一起,在大型深层次学习优化电动学和理论基础之间存在差距。在本文中,我们提议为大型批量培训采用新的全层和跨层适应率缩放算法(CLARS)算法。我们还通过对基于梯度的方法进行新的精细分析来分析拟议方法的趋同率。根据我们的分析,我们弥合了差距,并展示了三种流行的大批量培训技术的理论见解,包括线性学习率的提升、逐渐变暖和从层角度适应率的提升。广泛的实验表明,拟议的算法通过大差幅取代了逐渐变暖技术,并挫败了在图像网络数据集中培训高级深层神经网络(ResNet、DenseNet、移动网络)中最先进的大型批量优化器的趋同率。