Optimization in machine learning, both theoretical and applied, is presently dominated by first-order gradient methods such as stochastic gradient descent. Second-order optimization methods, that involve second derivatives and/or second order statistics of the data, are far less prevalent despite strong theoretical properties, due to their prohibitive computation, memory and communication costs. In an attempt to bridge this gap between theoretical and practical optimization, we present a scalable implementation of a second-order preconditioned method (concretely, a variant of full-matrix Adagrad), that along with several critical algorithmic and numerical improvements, provides significant convergence and wall-clock time improvements compared to conventional first-order methods on state-of-the-art deep models. Our novel design effectively utilizes the prevalent heterogeneous hardware architecture for training deep models, consisting of a multicore CPU coupled with multiple accelerator units. We demonstrate superior performance compared to state-of-the-art on very large learning tasks such as machine translation with Transformers, language modeling with BERT, click-through rate prediction on Criteo, and image classification on ImageNet with ResNet-50.
翻译:在理论和应用方面,优化机器学习目前主要采用第一级梯度方法,如随机梯度梯度下降。第二级优化方法涉及第二衍生物和/或数据第二顺序统计,尽管理论特性很强,但由于计算、记忆和通信成本高得令人望而却少得多。为了缩小理论和实践优化之间的这一差距,我们提出了一个可扩展的第二级先决条件方法(具体地说,全称Adagrad的变种)的实施,连同若干关键的算法和数字改进,与最先进的模型常规第一阶方法相比,提供了显著的趋同和墙上时钟的改进。我们的新设计有效地利用流行的多种硬件结构来培训深层模型,其中包括多功能CPU,加上多个加速器单位。我们展示了与最先进的大规模学习任务相比的优性能,例如与变异器的机器翻译、与BERT的模拟、对Critoo的点击率预测,以及用ResNet-50的图像网络的图像分类。