We propose a new, more general approach to the design of stochastic gradient-based optimization methods for machine learning. In this new framework, optimizers assume access to a batch of gradient estimates per iteration, rather than a single estimate. This better reflects the information that is actually available in typical machine learning setups. To demonstrate the usefulness of this generalized approach, we develop Eve, an adaptation of the Adam optimizer which uses examplewise gradients to obtain more accurate second-moment estimates. We provide preliminary experiments, without hyperparameter tuning, which show that the new optimizer slightly outperforms Adam on a small scale benchmark and performs the same or worse on larger scale benchmarks. Further work is needed to refine the algorithm and tune hyperparameters.
翻译:我们提出了设计机械学习的随机梯度优化方法的新的、更一般性的方法。在这个新的框架中,优化者假定每个迭代都能获得一批梯度估计数,而不是单一的估计数。这更好地反映了典型的机器学习设置中实际可得到的信息。为了证明这种通用方法的有用性,我们开发了夏娃,这是亚当优化器的调整,它使用示例梯度来获得更准确的第二步估计数。我们提供了初步实验,没有超参数调,它表明新的优化器在小规模基准上略优于亚当,在较大规模基准上则执行相同或更差的标准。我们还需要做进一步的工作来完善算法和调控超参数。