Mini-batch stochastic gradient descent (SGD) and variants thereof approximate the objective function's gradient with a small number of training examples, aka the batch size. Small batch sizes require little computation for each model update but can yield high-variance gradient estimates, which poses some challenges for optimization. Conversely, large batches require more computation but can yield higher precision gradient estimates. This work presents a method to adapt the batch size to the model's training loss. For various function classes, we show that our method requires the same order of model updates as gradient descent while requiring the same order of gradient computations as SGD. This method requires evaluating the model's loss on the entire dataset every model update. However, the required computation is greatly reduced with a passive approximation of the adaptive method. We provide extensive experiments illustrating that our methods require fewer model updates without increasing the total amount of computation.
翻译:小型批量的梯度下降(SGD)及其变体与目标函数梯度相近,有少量的培训实例,即批量大小。小批量尺寸对每个模型更新要求很少计算,但可得出高差梯度估计数,这给优化带来一些挑战。相反,大批量需要更多计算,但可得出更精确的梯度估计数。这项工作为调整批量大小以适应模型培训损失提供了一种方法。对于各种功能类别,我们显示我们的方法要求的模型更新顺序与梯度下降相同,同时要求的梯度计算顺序与SGD相同。这种方法要求评估每批模型更新整个数据集的损失。然而,所需计算数量随着适应方法的被动近似性而大大减少。我们提供了广泛的实验,说明我们的方法需要较少的模型更新,而不增加计算总量。