1.B. 亚当:通信效率高的大型培训与亚当的聚合速度 (1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed) - 专知论文

会员服务 ·

0

Adam · 可约的 · 优化器 · state-of-the-art · SGD ·

2021 年 2 月 4 日

1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed

翻译：1.B. 亚当:通信效率高的大型培训与亚当的聚合速度

Hanlin Tang,Shaoduo Gan,Ammar Ahmad Awan,Samyam Rajbhandari,Conglong Li,Xiangru Lian,Ji Liu,Ce Zhang,Yuxiong He

from arxiv, arXiv admin note: text overlap with arXiv:2008.11343

Scalable training of large models (like BERT and GPT-3) requires careful optimization rooted in model design, architecture, and system capabilities. From a system standpoint, communication has become a major bottleneck, especially on commodity systems with standard TCP interconnects that offer limited network bandwidth. Communication compression is an important technique to reduce training time on such systems. One of the most effective methods is error-compensated compression, which offers robust convergence speed even under 1-bit compression. However, state-of-the-art error compensation techniques only work with basic optimizers like SGD and momentum SGD, which are linearly dependent on the gradients. They do not work with non-linear gradient-based optimizers like Adam, which offer state-of-the-art convergence efficiency and accuracy for models like BERT. In this paper, we propose 1-bit Adam that reduces the communication volume by up to $5\times$, offers much better scalability, and provides the same convergence speed as uncompressed Adam. Our key finding is that Adam's variance (non-linear term) becomes stable (after a warmup phase) and can be used as a fixed precondition for the rest of the training (compression phase). Experiments on up to 256 GPUs show that 1-bit Adam enables up to $3.3\times$ higher throughput for BERT-Large pre-training and up to $2.9\times$ higher throughput for SQuAD fine-tuning. In addition, we provide theoretical analysis for our proposed work.

翻译：大型模型(如BERT和GPT-3)的可扩缩培训需要基于模型设计、架构和系统能力的审慎优化。从系统角度看,通信已经成为一个主要瓶颈,特别是在具有标准的TCP连接网络带带带有限带宽的商品系统上。通信压缩是缩短这类系统培训时间的重要技术。最有效的方法之一是校准错误压缩,它提供强力的趋同速度,即使是在1位压缩下也提供强力的趋同速度。然而,最先进的错误补偿技术只与基本优化者如SGD和势头SGD等依靠梯度线性调整的基本优化者一起工作。它们不与亚当等非线性梯度优化者一起工作,它们提供最先进的趋同效率和准确性。在本文件中,我们建议用1位亚当来将通信量减低至5美元,提供更好的缩放宽度,并且提供与不压成价的达标值的趋同速度。我们的主要发现是,Adam的变差(非线性任期)会稳定(经过一个暖阶段后)和精确的梯度优化精度优化优化优化优化优化优化化阶段。 3 将S25ADRADADA 用于一个固定阶段。

0

相关内容

Adam

神经网络序列数据建模，229页ppt，Modeling Sequential Data with Neural Nets

神经网络序列数据建模，229页ppt，Modeling Sequential Data with Neural Nets

专知会员服务

67+阅读 · 2020年7月25日

Fariz Darari简明《博弈论Game Theory》介绍，35页ppt

Fariz Darari简明《博弈论Game Theory》介绍，35页ppt

专知会员服务

112+阅读 · 2020年5月15日

《C++ Primer中文版第5版》电子书与学习笔记和课后练习答案

《C++ Primer中文版第5版》电子书与学习笔记和课后练习答案

专知会员服务

275+阅读 · 2020年2月13日

深度强化学习策略梯度教程，53页ppt

深度强化学习策略梯度教程，53页ppt

专知会员服务

184+阅读 · 2020年2月1日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【大规模数据系统，552页ppt】Large-scale Data Systems

【大规模数据系统，552页ppt】Large-scale Data Systems

专知会员服务

61+阅读 · 2019年12月21日

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

专知会员服务

28+阅读 · 2019年11月8日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

162+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

CCF推荐 | 国际会议信息10条

CCF推荐 | 国际会议信息10条

Call4Papers

8+阅读 · 2019年5月27日

CCF推荐 | 国际会议信息8条

CCF推荐 | 国际会议信息8条

Call4Papers

9+阅读 · 2019年5月23日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Call for Participation: Shared Tasks in NLPCC 2019

Call for Participation: Shared Tasks in NLPCC 2019

中国计算机学会

5+阅读 · 2019年3月22日

人工智能 | UAI 2019等国际会议信息4条

人工智能 | UAI 2019等国际会议信息4条

Call4Papers

6+阅读 · 2019年1月14日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

人工智能 | 国际会议信息6条

人工智能 | 国际会议信息6条

Call4Papers

5+阅读 · 2019年1月4日

人工智能 | 国际会议信息10条

人工智能 | 国际会议信息10条

Call4Papers

5+阅读 · 2018年12月18日

Hierarchical Imitation - Reinforcement Learning

Hierarchical Imitation - Reinforcement Learning

CreateAMind

19+阅读 · 2018年5月25日

【计算机类】期刊专刊/国际会议截稿信息6条

【计算机类】期刊专刊/国际会议截稿信息6条

Call4Papers

3+阅读 · 2017年10月13日

Parallel two-scale finite element implementation of a system with varying microstructures

Arxiv

0+阅读 · 2021年3月31日

Energy Efficient Edge Computing: When Lyapunov Meets Distributed Reinforcement Learning

Arxiv

0+阅读 · 2021年3月31日

BASE Layers: Simplifying Training of Large, Sparse Models

Arxiv

0+阅读 · 2021年3月30日

MergeComp: A Compression Scheduler for Scalable Communication-Efficient Distributed Training

Arxiv

0+阅读 · 2021年3月28日

Convergence Time Optimization for Federated Learning over Wireless Networks

Arxiv

0+阅读 · 2021年3月26日

Hierarchical Quantized Federated Learning: Convergence Analysis and System Design

Arxiv

0+阅读 · 2021年3月26日

Go Wide, Then Narrow: Efficient Training of Deep Thin Networks

Arxiv

15+阅读 · 2020年7月1日

Distributed Non-Convex Optimization with Sublinear Speedup under Intermittent Client Availability

Arxiv

11+阅读 · 2020年2月18日

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Arxiv

3+阅读 · 2019年9月25日

Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks

Arxiv

8+阅读 · 2019年5月20日

VIP会员

文章信息

相关主题

state-of-the-art

相关VIP内容

神经网络序列数据建模，229页ppt，Modeling Sequential Data with Neural Nets

神经网络序列数据建模，229页ppt，Modeling Sequential Data with Neural Nets

专知会员服务

67+阅读 · 2020年7月25日

Fariz Darari简明《博弈论Game Theory》介绍，35页ppt

Fariz Darari简明《博弈论Game Theory》介绍，35页ppt

专知会员服务

112+阅读 · 2020年5月15日

《C++ Primer中文版第5版》电子书与学习笔记和课后练习答案

《C++ Primer中文版第5版》电子书与学习笔记和课后练习答案

专知会员服务

275+阅读 · 2020年2月13日

深度强化学习策略梯度教程，53页ppt

深度强化学习策略梯度教程，53页ppt

专知会员服务

184+阅读 · 2020年2月1日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【大规模数据系统，552页ppt】Large-scale Data Systems

【大规模数据系统，552页ppt】Large-scale Data Systems

专知会员服务

61+阅读 · 2019年12月21日

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

专知会员服务

28+阅读 · 2019年11月8日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

162+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

热门VIP内容

开通专知VIP会员享更多权益服务

自动驾驶轨迹规划中的基础模型：进展综述与开放挑战

《用于提升多域战备的大型语言模型辅助场景生成器》报告

【斯坦福博士论文】为人类使用优化 AI 模型

国防领域人工智能规模化应用的理论与实践

相关资讯

CCF推荐 | 国际会议信息10条

CCF推荐 | 国际会议信息10条

Call4Papers

8+阅读 · 2019年5月27日

CCF推荐 | 国际会议信息8条

CCF推荐 | 国际会议信息8条

Call4Papers

9+阅读 · 2019年5月23日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Call for Participation: Shared Tasks in NLPCC 2019

Call for Participation: Shared Tasks in NLPCC 2019

中国计算机学会

5+阅读 · 2019年3月22日

人工智能 | UAI 2019等国际会议信息4条

人工智能 | UAI 2019等国际会议信息4条

Call4Papers

6+阅读 · 2019年1月14日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

人工智能 | 国际会议信息6条

人工智能 | 国际会议信息6条

Call4Papers

5+阅读 · 2019年1月4日

人工智能 | 国际会议信息10条

人工智能 | 国际会议信息10条

Call4Papers

5+阅读 · 2018年12月18日

Hierarchical Imitation - Reinforcement Learning

Hierarchical Imitation - Reinforcement Learning

CreateAMind

19+阅读 · 2018年5月25日

【计算机类】期刊专刊/国际会议截稿信息6条

【计算机类】期刊专刊/国际会议截稿信息6条

Call4Papers

3+阅读 · 2017年10月13日

相关论文

Parallel two-scale finite element implementation of a system with varying microstructures

Arxiv

0+阅读 · 2021年3月31日

Energy Efficient Edge Computing: When Lyapunov Meets Distributed Reinforcement Learning

Arxiv

0+阅读 · 2021年3月31日

BASE Layers: Simplifying Training of Large, Sparse Models

Arxiv

0+阅读 · 2021年3月30日

MergeComp: A Compression Scheduler for Scalable Communication-Efficient Distributed Training

Arxiv

0+阅读 · 2021年3月28日

Convergence Time Optimization for Federated Learning over Wireless Networks

Arxiv

0+阅读 · 2021年3月26日

Hierarchical Quantized Federated Learning: Convergence Analysis and System Design

Arxiv

0+阅读 · 2021年3月26日

Go Wide, Then Narrow: Efficient Training of Deep Thin Networks

Arxiv

15+阅读 · 2020年7月1日

Distributed Non-Convex Optimization with Sublinear Speedup under Intermittent Client Availability

Arxiv

11+阅读 · 2020年2月18日

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Arxiv

3+阅读 · 2019年9月25日

Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks

Arxiv

8+阅读 · 2019年5月20日

微信扫码咨询专知VIP会员