体面LaM:大型深层培训的分散化动力SGD (DecentLaM: Decentralized Momentum SGD for Large-batch Deep Training) - 专知论文

会员服务 ·

0

SGD · 动量 · Performer · 有偏 · 评论员 ·

2021 年 4 月 24 日

DecentLaM: Decentralized Momentum SGD for Large-batch Deep Training

翻译：体面LaM:大型深层培训的分散化动力SGD

Kun Yuan,Yiming Chen,Xinmeng Huang,Yingya Zhang,Pan Pan,Yinghui Xu,Wotao Yin

The scale of deep learning nowadays calls for efficient distributed training algorithms. Decentralized momentum SGD (DmSGD), in which each node averages only with its neighbors, is more communication efficient than vanilla Parallel momentum SGD that incurs global average across all computing nodes. On the other hand, the large-batch training has been demonstrated critical to achieve runtime speedup. This motivates us to investigate how DmSGD performs in the large-batch scenario. In this work, we find the momentum term can amplify the inconsistency bias in DmSGD. Such bias becomes more evident as batch-size grows large and hence results in severe performance degradation. We next propose DecentLaM, a novel decentralized large-batch momentum SGD to remove the momentum-incurred bias. The convergence rate for both non-convex and strongly-convex scenarios is established. Our theoretical results justify the superiority of DecentLaM to DmSGD especially in the large-batch scenario. Experimental results on a variety of computer vision tasks and models demonstrate that DecentLaM promises both efficient and high-quality training.

翻译：目前深层次学习的规模要求高效分布式培训算法。每个节点仅与邻国平均的分散动力 SGD(DmSGD)比所有计算节点全球均值的香草平行动力SGD(SGD)更具沟通效率。另一方面,大型批量培训已证明对实现运行速度加快至关重要。这促使我们调查DmSGD在大型批量情景中的表现。在这项工作中,我们发现动力术语可以扩大DmSGD中的不一致偏差。随着批量规模的扩大,这种偏差会变得更加明显,从而导致严重性能退化。我们接下来提议“DrigalLaM”(SGD),这是一个新的分散式大型批量动力驱动动力SGD(SGD),以消除动力错失的偏差。非凝固型情景的趋同率已经确立。我们的理论结果证明,体面LaM至DmSGD(特别是在大型批量情景中)的优越性。在各种计算机视觉任务和模型上的实验结果表明,光成像LAM将保证高效和高质量的培训。

1

相关内容

SGD

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

专知会员服务

135+阅读 · 2021年6月16日

【Google】平滑对抗训练，Smooth Adversarial Training

【Google】平滑对抗训练，Smooth Adversarial Training

专知会员服务

49+阅读 · 2020年7月4日

【Uber AI新论文】持续元学习，Learning to Continually Learn

【Uber AI新论文】持续元学习，Learning to Continually Learn

专知会员服务

37+阅读 · 2020年2月27日

【ICML2020提交论文】Learning@home:众包与分散Mixture-of-Experts训练的神经网络（Learning@home: Crowdsourced Training of Large Neural Networks with Decentralized Mixture-of-Experts）

【ICML2020提交论文】Learning@home:众包与分散Mixture-of-Experts训练的神经网络（Learning@home: Crowdsourced Training of Large Neural Networks with Decentralized Mixture-of-Experts）

专知会员服务

10+阅读 · 2020年2月12日

深度强化学习策略梯度教程，53页ppt

深度强化学习策略梯度教程，53页ppt

专知会员服务

184+阅读 · 2020年2月1日

【Google 76分钟训练万BERT最新论文】Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

【Google 76分钟训练万BERT最新论文】Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

专知会员服务

4+阅读 · 2020年1月7日

【论文】生成式教学网络:通过学习生成合成训练数据来加速神经结构搜索（Generative Teaching Networks: Accelerating Neural Architecture Search by Learning to Generate Synthetic Training Data）

【论文】生成式教学网络:通过学习生成合成训练数据来加速神经结构搜索（Generative Teaching Networks: Accelerating Neural Architecture Search by Learning to Generate Synthetic Training Data）

专知会员服务

14+阅读 · 2019年11月17日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

已删除

将门创投

5+阅读 · 2019年9月10日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

【干货】Batch Normalization: 如何更快地训练深度神经网络

【干货】Batch Normalization: 如何更快地训练深度神经网络

专知

13+阅读 · 2018年3月6日

干货 | 如何理解深度学习分布式训练中的large batch size与learning rate的关系？

干货 | 如何理解深度学习分布式训练中的large batch size与learning rate的关系？

AI科技评论

5+阅读 · 2017年11月2日

Decentralized Local Stochastic Extra-Gradient for Variational Inequalities

Arxiv

0+阅读 · 2021年6月15日

Over-the-Air Decentralized Federated Learning

Arxiv

0+阅读 · 2021年6月15日

On Large-Cohort Training for Federated Learning

Arxiv

0+阅读 · 2021年6月15日

A Hybrid Variance-Reduced Method for Decentralized Stochastic Non-Convex Optimization

A Hybrid Variance-Reduced Method for Decentralized Stochastic Non-Convex Optimization

Arxiv

0+阅读 · 2021年6月14日

An acceleration of decentralized SGD under general assumptions with low stochastic noise

Arxiv

0+阅读 · 2021年6月14日

Decentralized Distributed Optimization for Saddle Point Problems

Arxiv

0+阅读 · 2021年6月14日

Decentralized Personalized Federated Min-Max Problems

Arxiv

0+阅读 · 2021年6月14日

Optimal Complexity in Decentralized Training

Arxiv

1+阅读 · 2021年6月11日

Compressed Gradient Tracking Methods for Decentralized Optimization with Linear Convergence

Arxiv

0+阅读 · 2021年6月11日

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Arxiv

3+阅读 · 2019年9月25日

VIP会员

文章信息

相关主题

相关VIP内容

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

专知会员服务

135+阅读 · 2021年6月16日

【Google】平滑对抗训练，Smooth Adversarial Training

【Google】平滑对抗训练，Smooth Adversarial Training

专知会员服务

49+阅读 · 2020年7月4日

【Uber AI新论文】持续元学习，Learning to Continually Learn

【Uber AI新论文】持续元学习，Learning to Continually Learn

专知会员服务

37+阅读 · 2020年2月27日

【ICML2020提交论文】Learning@home:众包与分散Mixture-of-Experts训练的神经网络（Learning@home: Crowdsourced Training of Large Neural Networks with Decentralized Mixture-of-Experts）

【ICML2020提交论文】Learning@home:众包与分散Mixture-of-Experts训练的神经网络（Learning@home: Crowdsourced Training of Large Neural Networks with Decentralized Mixture-of-Experts）

专知会员服务

10+阅读 · 2020年2月12日

深度强化学习策略梯度教程，53页ppt

深度强化学习策略梯度教程，53页ppt

专知会员服务

184+阅读 · 2020年2月1日

【Google 76分钟训练万BERT最新论文】Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

【Google 76分钟训练万BERT最新论文】Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

专知会员服务

4+阅读 · 2020年1月7日

【论文】生成式教学网络:通过学习生成合成训练数据来加速神经结构搜索（Generative Teaching Networks: Accelerating Neural Architecture Search by Learning to Generate Synthetic Training Data）

【论文】生成式教学网络:通过学习生成合成训练数据来加速神经结构搜索（Generative Teaching Networks: Accelerating Neural Architecture Search by Learning to Generate Synthetic Training Data）

专知会员服务

14+阅读 · 2019年11月17日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

大语言模型幻觉：系统综述

《分析与预测陆军战斗体能测试表现：统计与机器学习方法》2025最新137页

【博士论文】数据与任务的物理学：深度学习中的局部性与组合性理论

代理式人工智能时代的决策优势

相关资讯

已删除

将门创投

5+阅读 · 2019年9月10日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

【干货】Batch Normalization: 如何更快地训练深度神经网络

【干货】Batch Normalization: 如何更快地训练深度神经网络

专知

13+阅读 · 2018年3月6日

干货 | 如何理解深度学习分布式训练中的large batch size与learning rate的关系？

干货 | 如何理解深度学习分布式训练中的large batch size与learning rate的关系？

AI科技评论

5+阅读 · 2017年11月2日

相关论文

Decentralized Local Stochastic Extra-Gradient for Variational Inequalities

Arxiv

0+阅读 · 2021年6月15日

Over-the-Air Decentralized Federated Learning

Arxiv

0+阅读 · 2021年6月15日

On Large-Cohort Training for Federated Learning

Arxiv

0+阅读 · 2021年6月15日

A Hybrid Variance-Reduced Method for Decentralized Stochastic Non-Convex Optimization

A Hybrid Variance-Reduced Method for Decentralized Stochastic Non-Convex Optimization

Arxiv

0+阅读 · 2021年6月14日

An acceleration of decentralized SGD under general assumptions with low stochastic noise

Arxiv

0+阅读 · 2021年6月14日

Decentralized Distributed Optimization for Saddle Point Problems

Arxiv

0+阅读 · 2021年6月14日

Decentralized Personalized Federated Min-Max Problems

Arxiv

0+阅读 · 2021年6月14日

Optimal Complexity in Decentralized Training

Arxiv

1+阅读 · 2021年6月11日

Compressed Gradient Tracking Methods for Decentralized Optimization with Linear Convergence

Arxiv

0+阅读 · 2021年6月11日

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Arxiv

3+阅读 · 2019年9月25日

微信扫码咨询专知VIP会员