Recently, convergence as well as convergence rate analyses of deep learning optimizers for nonconvex optimization have been widely studied. Meanwhile, numerical evaluations for the optimizers have precisely clarified the relationship between batch size and the number of steps needed for training deep neural networks. The main contribution of this paper is to show theoretically that the number of steps needed for nonconvex optimization of each of the optimizers can be expressed as a rational function of batch size. Having these rational functions leads to two particularly important facts, which were validated numerically in previous studies. The first fact is that there exists an optimal batch size such that the number of steps needed for nonconvex optimization is minimized. This implies that using larger batch sizes than the optimal batch size does not decrease the number of steps needed for nonconvex optimization. The second fact is that the optimal batch size depends on the optimizer. In particular, it is shown theoretically that momentum and Adam-type optimizers can exploit larger optimal batches and further reduce the minimum number of steps needed for nonconvex optimization than can the stochastic gradient descent optimizer.
翻译:最近,对非康维克斯优化的深学习优化的趋同率和趋同率分析进行了广泛研究;同时,优化的定量评价准确地澄清了批量规模与培训深神经网络所需步骤数量之间的关系;本文件的主要贡献是理论上表明,每个优化优化的不康维克斯所需步骤数量可以作为批量规模的合理函数来表示。有了这些合理功能,可以得出两个特别重要的事实,这些事实在以前的研究中经过数字验证。第一个事实是,存在一个最佳批量规模,使非康维克斯优化所需的步骤数量最小化。这意味着使用比最佳批量规模更大的批量规模不会减少非康维克斯优化所需的步骤数量。第二个事实是,最佳批量规模取决于优化。特别是,从理论上看,动力和亚当式优化器可以利用更大的最佳批量,并进一步减少非康维克斯优化所需的最低步骤数量,而不能将梯分层梯层优化器进一步减少。