具有动脉的斯托切斯梯度下层的超分光计上 (On the Hyperparameters in Stochastic Gradient Descent with Momentum) - 专知论文

会员服务 ·

0

动量 · SGD · 随机梯度下降 · Continuity · 线性的 ·

2021 年 8 月 9 日

On the Hyperparameters in Stochastic Gradient Descent with Momentum

翻译：具有动脉的斯托切斯梯度下层的超分光计上

Following the same routine as [SSJ20], we continue to present the theoretical analysis for stochastic gradient descent with momentum (SGD with momentum) in this paper. Differently, for SGD with momentum, we demonstrate it is the two hyperparameters together, the learning rate and the momentum coefficient, that play the significant role for the linear rate of convergence in non-convex optimization. Our analysis is based on the use of a hyperparameters-dependent stochastic differential equation (hp-dependent SDE) that serves as a continuous surrogate for SGD with momentum. Similarly, we establish the linear convergence for the continuous-time formulation of SGD with momentum and obtain an explicit expression for the optimal linear rate by analyzing the spectrum of the Kramers-Fokker-Planck operator. By comparison, we demonstrate how the optimal linear rate of convergence and the final gap for SGD only about the learning rate varies with the momentum coefficient increasing from zero to one when the momentum is introduced. Then, we propose a mathematical interpretation why the SGD with momentum converges faster and more robust about the learning rate than the standard SGD in practice. Finally, we show the Nesterov momentum under the existence of noise has no essential difference with the standard momentum.

翻译：遵循与[SSJ20]相同的常规,我们继续在本文件中介绍关于随机梯度下降的理论分析,并有动力(SGD,有动力)的理论分析。不同的是,对于有动力的SGD而言,我们展示的是两个超参数,即学习率和动力系数,这在非康韦克斯优化中的线性趋同率方面起着重要作用。我们的分析以使用依赖超光谱的随机梯度差分方程(hp-依赖SDE)为基础,该方程作为SGD的连续代谢器。同样,我们通过分析克拉默斯-福克-普朗克操作员的频谱,为SGD的连续时间配制建立线性趋同,并获得最佳线性速度的明确表达。相比之下,我们展示了SGD的最佳线性趋同率和最后差距是如何在引入动力系数时从0到1的。然后,我们提出数学解释,为什么SGD与动力的趋同速度比标准的SGD差异更迅速、更牢固地接近。最后,我们展示了标准的SGD的形成势头。

0

相关内容

动量方法 (Polyak, 1964) 旨在加速学习，特别是处理高曲率、小但一致的梯度，或是带噪声的梯度。动量算法积累了之前梯度指数级衰减的移动平均，并且继续沿该方向移动。

INRIA最新「机器学习理论」新书，229页pdf原理性阐述机器学习

INRIA最新「机器学习理论」新书，229页pdf原理性阐述机器学习

专知会员服务

69+阅读 · 2021年3月27日

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

51+阅读 · 2020年12月14日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

45+阅读 · 2020年10月31日

不可错过！UIUC最新《统计强化学习》课程！

专知会员服务

53+阅读 · 2020年9月7日

【ICML2020】噪声在随机梯度下降中的泛化效益，On the Generalization Benefit of Noise in Stochastic Gradient Descent

【ICML2020】噪声在随机梯度下降中的泛化效益，On the Generalization Benefit of Noise in Stochastic Gradient Descent

专知会员服务

19+阅读 · 2020年6月29日

【斯坦福大学】Gradient Surgery for Multi-Task Learning

【斯坦福大学】Gradient Surgery for Multi-Task Learning

专知会员服务

47+阅读 · 2020年1月23日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

2019年机器学习框架回顾

2019年机器学习框架回顾

专知会员服务

36+阅读 · 2019年10月11日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

meta learning 17年：MAML SNAIL

meta learning 17年：MAML SNAIL

CreateAMind

11+阅读 · 2019年1月2日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【SIGIR2018】五篇对抗训练文章

【SIGIR2018】五篇对抗训练文章

专知

12+阅读 · 2018年7月9日

神经网络学习率设置

神经网络学习率设置

机器学习研究会

4+阅读 · 2018年3月3日

条件GAN重大改进！cGANs with Projection Discriminator

条件GAN重大改进！cGANs with Projection Discriminator

CreateAMind

8+阅读 · 2018年2月7日

资源｜斯坦福课程：深度学习理论！

资源｜斯坦福课程：深度学习理论！

全球人工智能

17+阅读 · 2017年11月9日

【推荐】斯坦福课程：深度学习理论（附视频+讲义+阅读材料）

【推荐】斯坦福课程：深度学习理论（附视频+讲义+阅读材料）

机器学习研究会

9+阅读 · 2017年11月8日

【学习】Hierarchical Softmax

【学习】Hierarchical Softmax

机器学习研究会

4+阅读 · 2017年8月6日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

Large Learning Rate Tames Homogeneity: Convergence and Balancing Effect

Large Learning Rate Tames Homogeneity: Convergence and Balancing Effect

Arxiv

0+阅读 · 2021年10月7日

Hyperparameter Tuning with Renyi Differential Privacy

Arxiv

0+阅读 · 2021年10月7日

Fast and Robust Online Inference with Stochastic Gradient Descent via Random Scaling

Arxiv

0+阅读 · 2021年10月7日

On the Global Convergence of Gradient Descent for multi-layer ResNets in the mean-field regime

Arxiv

0+阅读 · 2021年10月6日

Linear Convergence of Generalized Mirror Descent with Time-Dependent Mirrors

Arxiv

0+阅读 · 2021年10月6日

Optimal rates of convergence and error localization of Gegenbauer projections

Arxiv

0+阅读 · 2021年10月6日

A Control-Theoretic Perspective on Optimal High-Order Optimization

Arxiv

0+阅读 · 2021年10月6日

The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study

The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study

Arxiv

4+阅读 · 2019年5月9日

Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

Arxiv

8+阅读 · 2018年11月21日

Accelerated Randomized Coordinate Descent Algorithms for Stochastic Optimization and Online Learning

Arxiv

9+阅读 · 2018年7月16日

VIP会员

文章信息

相关主题

随机梯度下降

相关VIP内容

INRIA最新「机器学习理论」新书，229页pdf原理性阐述机器学习

INRIA最新「机器学习理论」新书，229页pdf原理性阐述机器学习

专知会员服务

69+阅读 · 2021年3月27日

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

51+阅读 · 2020年12月14日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

45+阅读 · 2020年10月31日

不可错过！UIUC最新《统计强化学习》课程！

专知会员服务

53+阅读 · 2020年9月7日

【ICML2020】噪声在随机梯度下降中的泛化效益，On the Generalization Benefit of Noise in Stochastic Gradient Descent

【ICML2020】噪声在随机梯度下降中的泛化效益，On the Generalization Benefit of Noise in Stochastic Gradient Descent

专知会员服务

19+阅读 · 2020年6月29日

【斯坦福大学】Gradient Surgery for Multi-Task Learning

【斯坦福大学】Gradient Surgery for Multi-Task Learning

专知会员服务

47+阅读 · 2020年1月23日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

2019年机器学习框架回顾

2019年机器学习框架回顾

专知会员服务

36+阅读 · 2019年10月11日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

从社会学实验到行为仿真：理解基于Agent的观点动力学建模思维

中英文版《GPT-5 System Card速览》报告

ACL 2025 | 大模型结构化知识提示的泛化能力研究

【普林斯顿博士论文】大型模型的高效推理

相关资讯

meta learning 17年：MAML SNAIL

meta learning 17年：MAML SNAIL

CreateAMind

11+阅读 · 2019年1月2日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【SIGIR2018】五篇对抗训练文章

【SIGIR2018】五篇对抗训练文章

专知

12+阅读 · 2018年7月9日

神经网络学习率设置

神经网络学习率设置

机器学习研究会

4+阅读 · 2018年3月3日

条件GAN重大改进！cGANs with Projection Discriminator

条件GAN重大改进！cGANs with Projection Discriminator

CreateAMind

8+阅读 · 2018年2月7日

资源｜斯坦福课程：深度学习理论！

资源｜斯坦福课程：深度学习理论！

全球人工智能

17+阅读 · 2017年11月9日

【推荐】斯坦福课程：深度学习理论（附视频+讲义+阅读材料）

【推荐】斯坦福课程：深度学习理论（附视频+讲义+阅读材料）

机器学习研究会

9+阅读 · 2017年11月8日

【学习】Hierarchical Softmax

【学习】Hierarchical Softmax

机器学习研究会

4+阅读 · 2017年8月6日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

相关论文

Large Learning Rate Tames Homogeneity: Convergence and Balancing Effect

Large Learning Rate Tames Homogeneity: Convergence and Balancing Effect

Arxiv

0+阅读 · 2021年10月7日

Hyperparameter Tuning with Renyi Differential Privacy

Arxiv

0+阅读 · 2021年10月7日

Fast and Robust Online Inference with Stochastic Gradient Descent via Random Scaling

Arxiv

0+阅读 · 2021年10月7日

On the Global Convergence of Gradient Descent for multi-layer ResNets in the mean-field regime

Arxiv

0+阅读 · 2021年10月6日

Linear Convergence of Generalized Mirror Descent with Time-Dependent Mirrors

Arxiv

0+阅读 · 2021年10月6日

Optimal rates of convergence and error localization of Gegenbauer projections

Arxiv

0+阅读 · 2021年10月6日

A Control-Theoretic Perspective on Optimal High-Order Optimization

Arxiv

0+阅读 · 2021年10月6日

The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study

The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study

Arxiv

4+阅读 · 2019年5月9日

Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

Arxiv

8+阅读 · 2018年11月21日

Accelerated Randomized Coordinate Descent Algorithms for Stochastic Optimization and Online Learning

Arxiv

9+阅读 · 2018年7月16日

微信扫码咨询专知VIP会员