SGD最终迭代的趋同率:对尺寸依赖性的分析 (The Convergence Rate of SGD's Final Iterate: Analysis on Dimension Dependence) - 专知论文

会员服务 ·

0

SGD · Lipschitz · 凸函数 · Extensibility · 情景 ·

2021 年 6 月 28 日

The Convergence Rate of SGD's Final Iterate: Analysis on Dimension Dependence

翻译：SGD最终迭代的趋同率:对尺寸依赖性的分析

Daogao Liu,Zhou Lu

Stochastic Gradient Descent (SGD) is among the simplest and most popular methods in optimization. The convergence rate for SGD has been extensively studied and tight analyses have been established for the running average scheme, but the sub-optimality of the final iterate is still not well-understood. shamir2013stochastic gave the best known upper bound for the final iterate of SGD minimizing non-smooth convex functions, which is $O(\log T/\sqrt{T})$ for Lipschitz convex functions and $O(\log T/ T)$ with additional assumption on strongly convexity. The best known lower bounds, however, are worse than the upper bounds by a factor of $\log T$. harvey2019tight gave matching lower bounds but their construction requires dimension $d= T$. It was then asked by koren2020open how to characterize the final-iterate convergence of SGD in the constant dimension setting. In this paper, we answer this question in the more general setting for any $d\leq T$, proving $\Omega(\log d/\sqrt{T})$ and $\Omega(\log d/T)$ lower bounds for the sub-optimality of the final iterate of SGD in minimizing non-smooth Lipschitz convex and strongly convex functions respectively with standard step size schedules. Our results provide the first general dimension dependent lower bound on the convergence of SGD's final iterate, partially resolving a COLT open question raised by koren2020open. We also present further evidence to show the correct rate in one dimension should be $\Theta(1/\sqrt{T})$, such as a proof of a tight $O(1/\sqrt{T})$ upper bound for one-dimensional special cases in settings more general than koren2020open.

翻译：SGD 和 $O( log T/ t/ T) 的趋同率已经进行了广泛研究, 并且已经为运行平均机程建立了严格的分析, 但最后迭代的亚最佳度仍然不完全理解。 shamir2013stochacast 给出了 SGD 最终迭代最小化非moot convex (SGD) 功能最著名的上限。 20美元 (log T/\ sqrt{ T} ) 用于 Lipschitz convex 函数和 $O( log T/ T) 的趋同率。然而, 已知的最小值的下限比上限差, $T$。 hurve2019tight 给其构造需要维度 $d= T$。然后通过 orren202020 开放来描述 SGD 最终的趋同值的趋同值。在本文中, 我们首先回答这个问题, 在任何 $GEO\\ drate 常规值的直值中, ex slentrental ex ex ex dqration a deal deal deal deal ex ex ex ex deal deal dqrate a ex a ex ex a $ dqt ex a ex ex a ex a $ dal dex dqt ex a ex a ex a ex a ex a ex a ex a ex ex. slations a ex. slations a ex a ex a ex a ex.

0

相关内容

SGD

【经典书】计算理论导论，482页pdf

【经典书】计算理论导论，482页pdf

专知会员服务

86+阅读 · 2021年4月10日

INRIA最新「机器学习理论」新书，229页pdf原理性阐述机器学习

INRIA最新「机器学习理论」新书，229页pdf原理性阐述机器学习

专知会员服务

69+阅读 · 2021年3月27日

【MIT】约束最小-最大优化的复杂性，84页pdf

专知会员服务

44+阅读 · 2020年9月25日

最新《非光滑优化》十讲硬核课程，剑桥大学梁经纬博士主讲

最新《非光滑优化》十讲硬核课程，剑桥大学梁经纬博士主讲

专知会员服务

33+阅读 · 2020年8月14日

【ICML2020】噪声在随机梯度下降中的泛化效益，On the Generalization Benefit of Noise in Stochastic Gradient Descent

【ICML2020】噪声在随机梯度下降中的泛化效益，On the Generalization Benefit of Noise in Stochastic Gradient Descent

专知会员服务

19+阅读 · 2020年6月29日

(普林斯顿讲义)：高维概率论，326页pdf《Probability in High Dimension》

(普林斯顿讲义)：高维概率论，326页pdf《Probability in High Dimension》

专知会员服务

122+阅读 · 2020年5月30日

普林斯顿大学经典书《在线凸优化导论》，178页pdf

普林斯顿大学经典书《在线凸优化导论》，178页pdf

专知会员服务

185+阅读 · 2020年2月3日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Call for Participation: Shared Tasks in NLPCC 2019

Call for Participation: Shared Tasks in NLPCC 2019

中国计算机学会

5+阅读 · 2019年3月22日

【TED】生命中的每一年的智慧

【TED】生命中的每一年的智慧

英语演讲视频每日一推

10+阅读 · 2019年1月29日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

Disentangled的假设的探讨

Disentangled的假设的探讨

CreateAMind

9+阅读 · 2018年12月10日

CCF C类 | IJCNN 2019 Special Section : 信息论与深度学习

CCF C类 | IJCNN 2019 Special Section : 信息论与深度学习

Call4Papers

5+阅读 · 2018年12月7日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【论文推荐】最新六篇主题模型相关论文—领域特定知识库、神经变分推断、动态和静态主题模型

【论文推荐】最新六篇主题模型相关论文—领域特定知识库、神经变分推断、动态和静态主题模型

专知

19+阅读 · 2018年6月26日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

关关的刷题日记13——Leetcode 414. Third Maximum Number

关关的刷题日记13——Leetcode 414. Third Maximum Number

专知

3+阅读 · 2017年10月8日

Arbitrary-length analogs to de Bruijn sequences

Arxiv

0+阅读 · 2021年8月30日

Algorithm for the product of Jack polynomials and its application to the sphericity test

Arxiv

0+阅读 · 2021年8月30日

ADMM-based residual whiteness principle for automatic parameter selection in super-resolution problems

Arxiv

0+阅读 · 2021年8月30日

On a family of linear MRD codes with parameters $[8\times8,16,7]_q$

Arxiv

0+阅读 · 2021年8月30日

Density estimation in RKHS with application to Korobov spaces in high dimensions

Arxiv

0+阅读 · 2021年8月28日

Optimal Sample Complexity of Subgradient Descent for Amplitude Flow via Non-Lipschitz Matrix Concentration

Arxiv

0+阅读 · 2021年8月27日

Lower Bounds and Accelerated Algorithms for Bilevel Optimization

Arxiv

0+阅读 · 2021年8月27日

Optimal anytime regret with two experts

Arxiv

0+阅读 · 2021年8月26日

Adaptive and Universal Algorithms for Variational Inequalities with Optimal Convergence

Arxiv

0+阅读 · 2021年8月26日

Optimal Algorithms for Non-Smooth Distributed Optimization in Networks

Arxiv

7+阅读 · 2018年6月1日

VIP会员

文章信息

相关主题

相关VIP内容

【经典书】计算理论导论，482页pdf

【经典书】计算理论导论，482页pdf

专知会员服务

86+阅读 · 2021年4月10日

INRIA最新「机器学习理论」新书，229页pdf原理性阐述机器学习

INRIA最新「机器学习理论」新书，229页pdf原理性阐述机器学习

专知会员服务

69+阅读 · 2021年3月27日

【MIT】约束最小-最大优化的复杂性，84页pdf

专知会员服务

44+阅读 · 2020年9月25日

最新《非光滑优化》十讲硬核课程，剑桥大学梁经纬博士主讲

最新《非光滑优化》十讲硬核课程，剑桥大学梁经纬博士主讲

专知会员服务

33+阅读 · 2020年8月14日

【ICML2020】噪声在随机梯度下降中的泛化效益，On the Generalization Benefit of Noise in Stochastic Gradient Descent

【ICML2020】噪声在随机梯度下降中的泛化效益，On the Generalization Benefit of Noise in Stochastic Gradient Descent

专知会员服务

19+阅读 · 2020年6月29日

(普林斯顿讲义)：高维概率论，326页pdf《Probability in High Dimension》

(普林斯顿讲义)：高维概率论，326页pdf《Probability in High Dimension》

专知会员服务

122+阅读 · 2020年5月30日

普林斯顿大学经典书《在线凸优化导论》，178页pdf

普林斯顿大学经典书《在线凸优化导论》，178页pdf

专知会员服务

185+阅读 · 2020年2月3日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

【博士论文】扩展可扩展会话推荐的边界

别想太多：高效 R1 风格大型推理模型综述

【ACMMM2025】EvoVLMA: 进化式视觉-语言模型自适应

智能体网络：用AI智能体编织下一代网络

相关资讯

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Call for Participation: Shared Tasks in NLPCC 2019

Call for Participation: Shared Tasks in NLPCC 2019

中国计算机学会

5+阅读 · 2019年3月22日

【TED】生命中的每一年的智慧

【TED】生命中的每一年的智慧

英语演讲视频每日一推

10+阅读 · 2019年1月29日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

Disentangled的假设的探讨

Disentangled的假设的探讨

CreateAMind

9+阅读 · 2018年12月10日

CCF C类 | IJCNN 2019 Special Section : 信息论与深度学习

CCF C类 | IJCNN 2019 Special Section : 信息论与深度学习

Call4Papers

5+阅读 · 2018年12月7日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【论文推荐】最新六篇主题模型相关论文—领域特定知识库、神经变分推断、动态和静态主题模型

【论文推荐】最新六篇主题模型相关论文—领域特定知识库、神经变分推断、动态和静态主题模型

专知

19+阅读 · 2018年6月26日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

关关的刷题日记13——Leetcode 414. Third Maximum Number

关关的刷题日记13——Leetcode 414. Third Maximum Number

专知

3+阅读 · 2017年10月8日

相关论文

Arbitrary-length analogs to de Bruijn sequences

Arxiv

0+阅读 · 2021年8月30日

Algorithm for the product of Jack polynomials and its application to the sphericity test

Arxiv

0+阅读 · 2021年8月30日

ADMM-based residual whiteness principle for automatic parameter selection in super-resolution problems

Arxiv

0+阅读 · 2021年8月30日

On a family of linear MRD codes with parameters $[8\times8,16,7]_q$

Arxiv

0+阅读 · 2021年8月30日

Density estimation in RKHS with application to Korobov spaces in high dimensions

Arxiv

0+阅读 · 2021年8月28日

Optimal Sample Complexity of Subgradient Descent for Amplitude Flow via Non-Lipschitz Matrix Concentration

Arxiv

0+阅读 · 2021年8月27日

Lower Bounds and Accelerated Algorithms for Bilevel Optimization

Arxiv

0+阅读 · 2021年8月27日

Optimal anytime regret with two experts

Arxiv

0+阅读 · 2021年8月26日

Adaptive and Universal Algorithms for Variational Inequalities with Optimal Convergence

Arxiv

0+阅读 · 2021年8月26日

Optimal Algorithms for Non-Smooth Distributed Optimization in Networks

Arxiv

7+阅读 · 2018年6月1日

微信扫码咨询专知VIP会员