MaxVA: 通过最大化观察到的梯度差异, 快速调整步数大小 (MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients) - 专知论文

会员服务 ·

0

Adam · 自适应学习 · 方差 · 估计/估计量 · Weight ·

2021 年 7 月 4 日

MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients

翻译：MaxVA: 通过最大化观察到的梯度差异, 快速调整步数大小

Chen Zhu,Yu Cheng,Zhe Gan,Furong Huang,Jingjing Liu,Tom Goldstein

from arxiv, ECML PKDD 2021

Adaptive gradient methods such as RMSProp and Adam use exponential moving estimate of the squared gradient to compute adaptive step sizes, achieving better convergence than SGD in face of noisy objectives. However, Adam can have undesirable convergence behaviors due to unstable or extreme adaptive learning rates. Methods such as AMSGrad and AdaBound have been proposed to stabilize the adaptive learning rates of Adam in the later stage of training, but they do not outperform Adam in some practical tasks such as training Transformers \cite{transformer}. In this paper, we propose an adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance of each coordinate. This results in a faster adaptation to the local gradient variance, which leads to more desirable empirical convergence behaviors than Adam. We prove the proposed algorithm converges under mild assumptions for nonconvex stochastic optimization problems, and demonstrate the improved efficacy of our adaptive averaging approach on machine translation, natural language understanding and large-batch pretraining of BERT. The code is available at https://github.com/zhuchen03/MaxVA.

翻译：RMSProp 和 Adam 等适应性梯度方法使用平方梯度指数移动估计值来计算适应性步数大小,在面对吵闹的目标时比SGD更接近。然而,Adam可能会由于不稳定或极端的适应性学习率而产生不可取的趋同行为。一些方法,如AMSGrad 和 AdaBound 已经提议在培训的后期阶段稳定Adam的适应性学习率,但在培训变异器和变异器等一些实际任务中,这些方法并不比Adam高。我们在此文件中提出了适应性学习率原则,其中亚当的正方梯度运行平均值被加权平均值取代,并选择了加权平均值,以尽量扩大每个坐标的估计差异。这导致更快地适应当地梯度差异,从而导致比Adam更可取的经验趋同行为。我们证明拟议的算法在非convex 蒸气优化问题的温和假设下趋于一致,并表明我们在机器翻译、自然语言理解和大批前训练BERTERT的适应性平均法方法的效能有所提高。代码可在 https://gthhuthub.comzchan03/zchan03/MVA03/MA03。

0

相关内容

Adam

《算法凸几何》简明书，Algorithmic Convex Geometry，50页pdf

专知会员服务

42+阅读 · 2021年4月2日

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

51+阅读 · 2020年12月14日

【干货书】机器学习速查手册，135页pdf

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

【伯克利-Ke Li】学习优化，74页ppt，Learning to Optimize

【伯克利-Ke Li】学习优化，74页ppt，Learning to Optimize

专知会员服务

41+阅读 · 2020年7月23日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【CoRL2019最佳论文】模仿学习，A Divergence Minimization Perspective on Imitation Learning Methods

【CoRL2019最佳论文】模仿学习，A Divergence Minimization Perspective on Imitation Learning Methods

专知会员服务

24+阅读 · 2019年11月11日

【课程】普林斯顿大学19年春季学期《机器学习优化》课程讲义

【课程】普林斯顿大学19年春季学期《机器学习优化》课程讲义

专知会员服务

85+阅读 · 2019年10月29日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

鲁棒机器学习相关文献集

鲁棒机器学习相关文献集

专知

8+阅读 · 2019年8月18日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

meta learning 17年：MAML SNAIL

meta learning 17年：MAML SNAIL

CreateAMind

11+阅读 · 2019年1月2日

Disentangled的假设的探讨

Disentangled的假设的探讨

CreateAMind

9+阅读 · 2018年12月10日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

COCO Denoiser: Using Co-Coercivity for Variance Reduction in Stochastic Convex Optimization

COCO Denoiser: Using Co-Coercivity for Variance Reduction in Stochastic Convex Optimization

Arxiv

0+阅读 · 2021年9月7日

Fast approximations of pseudo-observations in the context of right-censoring and interval-censoring

Arxiv

0+阅读 · 2021年9月7日

Tailoring: encoding inductive biases by optimizing unsupervised objectives at prediction time

Tailoring: encoding inductive biases by optimizing unsupervised objectives at prediction time

Arxiv

0+阅读 · 2021年9月6日

On Empirical Risk Minimization with Dependent and Heavy-Tailed Data

Arxiv

0+阅读 · 2021年9月6日

Minimal Variance Sampling with Provable Guarantees for Fast Training of Graph Neural Networks

Arxiv

0+阅读 · 2021年9月5日

Policy Gradient Bayesian Robust Optimization for Imitation Learning

Arxiv

5+阅读 · 2021年6月11日

Unbalanced minibatch Optimal Transport; applications to Domain Adaptation

Arxiv

3+阅读 · 2021年3月5日

PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation

PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation

Arxiv

8+阅读 · 2018年12月18日

Large-Scale Stochastic Sampling from the Probability Simplex

Arxiv

3+阅读 · 2018年6月19日

Variance-based regularization with convex objectives

Arxiv

5+阅读 · 2017年12月14日

VIP会员

文章信息

相关主题

自适应学习

估计/估计量

相关VIP内容

《算法凸几何》简明书，Algorithmic Convex Geometry，50页pdf

专知会员服务

42+阅读 · 2021年4月2日

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

51+阅读 · 2020年12月14日

【干货书】机器学习速查手册，135页pdf

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

【伯克利-Ke Li】学习优化，74页ppt，Learning to Optimize

【伯克利-Ke Li】学习优化，74页ppt，Learning to Optimize

专知会员服务

41+阅读 · 2020年7月23日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【CoRL2019最佳论文】模仿学习，A Divergence Minimization Perspective on Imitation Learning Methods

【CoRL2019最佳论文】模仿学习，A Divergence Minimization Perspective on Imitation Learning Methods

专知会员服务

24+阅读 · 2019年11月11日

【课程】普林斯顿大学19年春季学期《机器学习优化》课程讲义

【课程】普林斯顿大学19年春季学期《机器学习优化》课程讲义

专知会员服务

85+阅读 · 2019年10月29日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《人工智能绝不能完全自主》

《人工智能的法律与伦理：军事自主机器独特挑战的深度剖析》316页

从数据到主导：AI与兵棋推演构筑决策优势

《特洛伊木马货柜：武器化集装箱的战略威胁》最新报告

相关资讯

鲁棒机器学习相关文献集

鲁棒机器学习相关文献集

专知

8+阅读 · 2019年8月18日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

meta learning 17年：MAML SNAIL

meta learning 17年：MAML SNAIL

CreateAMind

11+阅读 · 2019年1月2日

Disentangled的假设的探讨

Disentangled的假设的探讨

CreateAMind

9+阅读 · 2018年12月10日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

相关论文

COCO Denoiser: Using Co-Coercivity for Variance Reduction in Stochastic Convex Optimization

COCO Denoiser: Using Co-Coercivity for Variance Reduction in Stochastic Convex Optimization

Arxiv

0+阅读 · 2021年9月7日

Fast approximations of pseudo-observations in the context of right-censoring and interval-censoring

Arxiv

0+阅读 · 2021年9月7日

Tailoring: encoding inductive biases by optimizing unsupervised objectives at prediction time

Tailoring: encoding inductive biases by optimizing unsupervised objectives at prediction time

Arxiv

0+阅读 · 2021年9月6日

On Empirical Risk Minimization with Dependent and Heavy-Tailed Data

Arxiv

0+阅读 · 2021年9月6日

Minimal Variance Sampling with Provable Guarantees for Fast Training of Graph Neural Networks

Arxiv

0+阅读 · 2021年9月5日

Policy Gradient Bayesian Robust Optimization for Imitation Learning

Arxiv

5+阅读 · 2021年6月11日

Unbalanced minibatch Optimal Transport; applications to Domain Adaptation

Arxiv

3+阅读 · 2021年3月5日

PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation

PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation

Arxiv

8+阅读 · 2018年12月18日

Large-Scale Stochastic Sampling from the Probability Simplex

Arxiv

3+阅读 · 2018年6月19日

Variance-based regularization with convex objectives

Arxiv

5+阅读 · 2017年12月14日

微信扫码咨询专知VIP会员