随机梯度下降的概率稳定性 (The Probabilistic Stability of Stochastic Gradient Descent) - 专知论文

会员服务 ·

0

SGD · Learning · 随机梯度下降 · Networking · Neural Networks ·

2023 年 3 月 23 日

The Probabilistic Stability of Stochastic Gradient Descent

翻译：随机梯度下降的概率稳定性

Liu Ziyin,Botao Li,Tomer Galanti,Masahito Ueda

from arxiv, preprint

A fundamental open problem in deep learning theory is how to define and understand the stability of stochastic gradient descent (SGD) close to a fixed point. Conventional literature relies on the convergence of statistical moments, esp., the variance, of the parameters to quantify the stability. We revisit the definition of stability for SGD and use the \textit{convergence in probability} condition to define the \textit{probabilistic stability} of SGD. The proposed stability directly answers a fundamental question in deep learning theory: how SGD selects a meaningful solution for a neural network from an enormous number of solutions that may overfit badly. To achieve this, we show that only under the lens of probabilistic stability does SGD exhibit rich and practically relevant phases of learning, such as the phases of the complete loss of stability, incorrect learning, convergence to low-rank saddles, and correct learning. When applied to a neural network, these phase diagrams imply that SGD prefers low-rank saddles when the underlying gradient is noisy, thereby improving the learning performance. This result is in sharp contrast to the conventional wisdom that SGD prefers flatter minima to sharp ones, which we find insufficient to explain the experimental data. We also prove that the probabilistic stability of SGD can be quantified by the Lyapunov exponents of the SGD dynamics, which can easily be measured in practice. Our work potentially opens a new venue for addressing the fundamental question of how the learning algorithm affects the learning outcome in deep learning.

翻译：深度学习理论中的一个基本问题是如何在接近固定点时定义和理解随机梯度下降（SGD）的稳定性。传统的文献依赖于参数的统计矩（特别是方差）的收敛来量化稳定性。我们重新审视了SGD的稳定性定义，并使用“概率收敛”条件来定义SGD的“概率稳定性”。所提出的稳定性直接回答了深度学习理论中一个根本的问题：当神经网络有大量可能过拟合的解时，SGD如何选择有意义的解。为了实现这一目标，我们展示了只有在概率稳定性的视角下，SGD才表现出丰富且实用的学习阶段，例如完全失去稳定性、错误学习、收敛于低秩鞍点和正确学习等阶段。当应用于神经网络时，这些阶段图表意味着当梯度存在噪声时，SGD更喜欢低秩鞍点，从而提高学习性能。这个结果与传统的智慧相反，即SGD更喜欢平坦的最小值而不是陡峭的最小值，我们发现这种做法无法解释实验数据。我们还证明SGD的概率稳定性可以通过SGD动力学的李雅普诺夫指数来量化，这在实践中易于测量。我们的研究潜在地开辟了一个新的途径，用于解决深度学习中学习算法如何影响学习结果的基本问题。

0

相关内容

SGD

【干货书】数据分析优化，Optimization for Modern Data Analysis，117页pdf

【干货书】数据分析优化，Optimization for Modern Data Analysis，117页pdf

专知会员服务

63+阅读 · 2023年2月15日

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

75+阅读 · 2022年6月28日

【硬核书】树与网络上的概率，716页pdf

【硬核书】树与网络上的概率，716页pdf

专知会员服务

77+阅读 · 2021年12月8日

【经典书】应用随机微分方程，324页pdf，Applied Stochastic Differential Equations

【经典书】应用随机微分方程，324页pdf，Applied Stochastic Differential Equations

专知会员服务

58+阅读 · 2020年11月21日

【ICML2020】噪声在随机梯度下降中的泛化效益，On the Generalization Benefit of Noise in Stochastic Gradient Descent

【ICML2020】噪声在随机梯度下降中的泛化效益，On the Generalization Benefit of Noise in Stochastic Gradient Descent

专知会员服务

19+阅读 · 2020年6月29日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

【KDD2020-清华大学】理解图表示学习中的负采样，Understanding Negative Sampling in Graph Representation Learning

【KDD2020-清华大学】理解图表示学习中的负采样，Understanding Negative Sampling in Graph Representation Learning

专知会员服务

58+阅读 · 2020年5月21日

【ICLR2020】深度神经网络优化轨迹的平衡点，The Break-Even Point on Optimization Trajectories of Deep Neural Networks

【ICLR2020】深度神经网络优化轨迹的平衡点，The Break-Even Point on Optimization Trajectories of Deep Neural Networks

专知会员服务

34+阅读 · 2020年2月27日

【论文】用于推理的概率逻辑神经网络（Probabilistic Logic Neural Networks for Reasoning）

【论文】用于推理的概率逻辑神经网络（Probabilistic Logic Neural Networks for Reasoning）

专知会员服务

104+阅读 · 2019年12月30日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

特征筛选还在用XGB的Feature Importance？试试Permutation Importance

特征筛选还在用XGB的Feature Importance？试试Permutation Importance

PaperWeekly

0+阅读 · 2022年9月30日

局部学习的特征选择：Local-Learning-Based Feature Selection

局部学习的特征选择：Local-Learning-Based Feature Selection

我爱读PAMI

14+阅读 · 2019年9月20日

深度卷积神经网络中的降采样

深度卷积神经网络中的降采样

极市平台

12+阅读 · 2019年5月24日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

换个角度看GAN：另一种损失函数

换个角度看GAN：另一种损失函数

机器之心

16+阅读 · 2019年1月1日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

一类稳态Schödinger-Poisson-Slater方程标准化解的研究

国家自然科学基金

1+阅读 · 2015年12月31日

Schr？dinger-Poisson方程守恒DDG方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

空间分数阶Schr？dinger方程的时间分裂谱方法

国家自然科学基金

0+阅读 · 2014年12月31日

神经网络随机学习算法的泛化性研究

国家自然科学基金

2+阅读 · 2013年12月31日

非凸映射的Robinson-Ursescu定理及度量次正则性

国家自然科学基金

0+阅读 · 2012年12月31日

随机泛函微分方程的动力学性态

国家自然科学基金

0+阅读 · 2012年12月31日

随机泛函微分方程的适定性与渐近性分析

国家自然科学基金

0+阅读 · 2012年12月31日

随机微分方程守恒型数值方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

能量临界情形的非线性Schrodinger方程

国家自然科学基金

0+阅读 · 2011年12月31日

随机微分方程的逼近

国家自然科学基金

0+阅读 · 2009年12月31日

Online Learning Under A Separable Stochastic Approximation Framework

Arxiv

0+阅读 · 2023年5月12日

Random Smoothing Regularization in Kernel Gradient Descent Learning

Arxiv

0+阅读 · 2023年5月12日

Generalized Iterative Scaling for Regularized Optimal Transport with Affine Constraints: Application Examples

Arxiv

0+阅读 · 2023年5月11日

Convergence of Alternating Gradient Descent for Matrix Factorization

Arxiv

0+阅读 · 2023年5月11日

An Overview of Asymptotic Normality in Stochastic Blockmodels: Cluster Analysis and Inference

Arxiv

0+阅读 · 2023年5月10日

Supervised learning with probabilistic morphisms and kernel mean embeddings

Arxiv

0+阅读 · 2023年5月10日

DNN Verification, Reachability, and the Exponential Function Problem

Arxiv

0+阅读 · 2023年5月10日

Convergence of a Normal Map-based Prox-SGD Method under the KL Inequality

Arxiv

0+阅读 · 2023年5月10日

Minimal Variance Sampling with Provable Guarantees for Fast Training of Graph Neural Networks

Minimal Variance Sampling with Provable Guarantees for Fast Training of Graph Neural Networks

Arxiv

13+阅读 · 2020年6月24日

A Survey of Deep Learning for Scientific Discovery

A Survey of Deep Learning for Scientific Discovery

Arxiv

29+阅读 · 2020年3月26日

VIP会员

文章信息

相关主题

随机梯度下降

Neural Networks

相关VIP内容

【干货书】数据分析优化，Optimization for Modern Data Analysis，117页pdf

【干货书】数据分析优化，Optimization for Modern Data Analysis，117页pdf

专知会员服务

63+阅读 · 2023年2月15日

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

75+阅读 · 2022年6月28日

【硬核书】树与网络上的概率，716页pdf

【硬核书】树与网络上的概率，716页pdf

专知会员服务

77+阅读 · 2021年12月8日

【经典书】应用随机微分方程，324页pdf，Applied Stochastic Differential Equations

【经典书】应用随机微分方程，324页pdf，Applied Stochastic Differential Equations

专知会员服务

58+阅读 · 2020年11月21日

【ICML2020】噪声在随机梯度下降中的泛化效益，On the Generalization Benefit of Noise in Stochastic Gradient Descent

【ICML2020】噪声在随机梯度下降中的泛化效益，On the Generalization Benefit of Noise in Stochastic Gradient Descent

专知会员服务

19+阅读 · 2020年6月29日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

【KDD2020-清华大学】理解图表示学习中的负采样，Understanding Negative Sampling in Graph Representation Learning

【KDD2020-清华大学】理解图表示学习中的负采样，Understanding Negative Sampling in Graph Representation Learning

专知会员服务

58+阅读 · 2020年5月21日

【ICLR2020】深度神经网络优化轨迹的平衡点，The Break-Even Point on Optimization Trajectories of Deep Neural Networks

【ICLR2020】深度神经网络优化轨迹的平衡点，The Break-Even Point on Optimization Trajectories of Deep Neural Networks

专知会员服务

34+阅读 · 2020年2月27日

【论文】用于推理的概率逻辑神经网络（Probabilistic Logic Neural Networks for Reasoning）

【论文】用于推理的概率逻辑神经网络（Probabilistic Logic Neural Networks for Reasoning）

专知会员服务

104+阅读 · 2019年12月30日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

【CMU博士论文】数据驱动决策中的激励、信息与不确定性

DGP双粒度提示框架：图增强大模型助力欺诈检测

【ICCV2025】ESSENTIAL：用于视频类增量学习的情景记忆与语义记忆整合

唯快不破：大型语言模型高效架构综述

相关资讯

特征筛选还在用XGB的Feature Importance？试试Permutation Importance

特征筛选还在用XGB的Feature Importance？试试Permutation Importance

PaperWeekly

0+阅读 · 2022年9月30日

局部学习的特征选择：Local-Learning-Based Feature Selection

局部学习的特征选择：Local-Learning-Based Feature Selection

我爱读PAMI

14+阅读 · 2019年9月20日

深度卷积神经网络中的降采样

深度卷积神经网络中的降采样

极市平台

12+阅读 · 2019年5月24日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

换个角度看GAN：另一种损失函数

换个角度看GAN：另一种损失函数

机器之心

16+阅读 · 2019年1月1日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

相关论文

Online Learning Under A Separable Stochastic Approximation Framework

Arxiv

0+阅读 · 2023年5月12日

Random Smoothing Regularization in Kernel Gradient Descent Learning

Arxiv

0+阅读 · 2023年5月12日

Generalized Iterative Scaling for Regularized Optimal Transport with Affine Constraints: Application Examples

Arxiv

0+阅读 · 2023年5月11日

Convergence of Alternating Gradient Descent for Matrix Factorization

Arxiv

0+阅读 · 2023年5月11日

An Overview of Asymptotic Normality in Stochastic Blockmodels: Cluster Analysis and Inference

Arxiv

0+阅读 · 2023年5月10日

Supervised learning with probabilistic morphisms and kernel mean embeddings

Arxiv

0+阅读 · 2023年5月10日

DNN Verification, Reachability, and the Exponential Function Problem

Arxiv

0+阅读 · 2023年5月10日

Convergence of a Normal Map-based Prox-SGD Method under the KL Inequality

Arxiv

0+阅读 · 2023年5月10日

Minimal Variance Sampling with Provable Guarantees for Fast Training of Graph Neural Networks

Minimal Variance Sampling with Provable Guarantees for Fast Training of Graph Neural Networks

Arxiv

13+阅读 · 2020年6月24日

A Survey of Deep Learning for Scientific Discovery

A Survey of Deep Learning for Scientific Discovery

Arxiv

29+阅读 · 2020年3月26日

相关基金

一类稳态Schödinger-Poisson-Slater方程标准化解的研究

国家自然科学基金

1+阅读 · 2015年12月31日

Schr？dinger-Poisson方程守恒DDG方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

空间分数阶Schr？dinger方程的时间分裂谱方法

国家自然科学基金

0+阅读 · 2014年12月31日

神经网络随机学习算法的泛化性研究

国家自然科学基金

2+阅读 · 2013年12月31日

非凸映射的Robinson-Ursescu定理及度量次正则性

国家自然科学基金

0+阅读 · 2012年12月31日

随机泛函微分方程的动力学性态

国家自然科学基金

0+阅读 · 2012年12月31日

随机泛函微分方程的适定性与渐近性分析

国家自然科学基金

0+阅读 · 2012年12月31日

随机微分方程守恒型数值方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

能量临界情形的非线性Schrodinger方程

国家自然科学基金

0+阅读 · 2011年12月31日

随机微分方程的逼近

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员