Moshpit SGD: 关于异基因不可靠装置的通信-高效分散化培训 (Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices) - 专知论文

会员服务 ·

0

SGD · Networking · contrastive · 联邦学习 · Neural Networks ·

2021 年 11 月 8 日

Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices

翻译：Moshpit SGD: 关于异基因不可靠装置的通信-高效分散化培训

Max Ryabinin,Eduard Gorbunov,Vsevolod Plokhotnyuk,Gennady Pekhimenko

from arxiv, Accepted to Conference on Neural Information Processing Systems (NeurIPS) 2021. 50 pages, 6 figures. Code: https://github.com/yandex-research/moshpit-sgd

Training deep neural networks on large datasets can often be accelerated by using multiple compute nodes. This approach, known as distributed training, can utilize hundreds of computers via specialized message-passing protocols such as Ring All-Reduce. However, running these protocols at scale requires reliable high-speed networking that is only available in dedicated clusters. In contrast, many real-world applications, such as federated learning and cloud-based distributed training, operate on unreliable devices with unstable network bandwidth. As a result, these applications are restricted to using parameter servers or gossip-based averaging protocols. In this work, we lift that restriction by proposing Moshpit All-Reduce - an iterative averaging protocol that exponentially converges to the global average. We demonstrate the efficiency of our protocol for distributed optimization with strong theoretical guarantees. The experiments show 1.3x speedup for ResNet-50 training on ImageNet compared to competitive gossip-based strategies and 1.5x speedup when training ALBERT-large from scratch using preemptible compute nodes.

翻译：大型数据集的深度神经网络培训通常可以通过多种计算节点加速。这种方法被称为分布式培训,可以通过专用信息传输协议(如环全环-环环-环环-环-环-环-环-环-环-环-环)使用数百台计算机。然而,大规模运行这些协议需要可靠的高速网络,而只有专门集群才能提供这种网络。相比之下,许多现实世界应用软件,如联合学习和云传播培训,都以不稳定网络带宽的不可靠设备运作。因此,这些应用软件仅限于使用参数服务器或八卦平均协议。在这项工作中,我们通过提出Mushpit All-Reduce(即一个与全球平均值成倍一致的迭接轨平均协议)来取消这一限制。我们展示了我们协议在分布优化方面的效率,并提供了强有力的理论保证。实验显示,与竞争性八卦策略相比,图像网络Res-50培训速度为1.3x,在培训ALBERT从抓起时,用1.5x速度为1.5x速度。

0

相关内容

SGD

【ICML2021】异质风险最小化，Heterogeneous Risk Minimization

专知会员服务

16+阅读 · 2021年5月21日

最新6篇ICLR2021篇图神经网络论文推荐

专知会员服务

57+阅读 · 2021年1月26日

最新《图理论》笔记书，98页pdf

最新《图理论》笔记书，98页pdf

专知会员服务

76+阅读 · 2020年12月27日

【WWW2020-MAGNN】异质图嵌入的集合图神经网络 MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding

【WWW2020-MAGNN】异质图嵌入的集合图神经网络 MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding

专知会员服务

116+阅读 · 2020年2月10日

【多伦多大学】神经数据服务器:用于传输学习数据的大型搜索引擎，Neural Data Server: A Large-Scale Search Engine for Transfer Learning Data

【多伦多大学】神经数据服务器:用于传输学习数据的大型搜索引擎，Neural Data Server: A Large-Scale Search Engine for Transfer Learning Data

专知会员服务

7+阅读 · 2020年1月9日

如何加速NVIDIA gpu上的训练、推理和ML应用？108页ppt，Accelerating training, inference, and ML applications on NVIDIA GPUs

如何加速NVIDIA gpu上的训练、推理和ML应用？108页ppt，Accelerating training, inference, and ML applications on NVIDIA GPUs

专知会员服务

61+阅读 · 2019年12月29日

【斯坦福大学】深度学习技巧速查清单《CS 230 - Deep Learning Tips and Tricks Cheatsheet》

【斯坦福大学】深度学习技巧速查清单《CS 230 - Deep Learning Tips and Tricks Cheatsheet》

专知会员服务

29+阅读 · 2019年12月19日

【O'Reilly TensorFlow World 2019】在NVIDIA GPU上加速训练，推理和ML应用（Accelerating training, inference, and ML applications on NVIDIA GPUs），NVIDIA，Maggie Zhang ，Nathan Luehr，Josh Romero，Pooya Davoodi，Pooya Davoodi

【O'Reilly TensorFlow World 2019】在NVIDIA GPU上加速训练，推理和ML应用（Accelerating training, inference, and ML applications on NVIDIA GPUs），NVIDIA，Maggie Zhang ，Nathan Luehr，Josh Romero，Pooya Davoodi，Pooya Davoodi

专知会员服务

7+阅读 · 2019年11月13日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

最新BERT相关论文清单，BERT-related Papers

最新BERT相关论文清单，BERT-related Papers

专知会员服务

53+阅读 · 2019年9月29日

CCF推荐 | 国际会议信息10条

CCF推荐 | 国际会议信息10条

Call4Papers

8+阅读 · 2019年5月27日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

STRCF for Visual Object Tracking

STRCF for Visual Object Tracking

统计学习与视觉计算组

15+阅读 · 2018年5月29日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

人工智能 | 国际会议截稿信息9条

人工智能 | 国际会议截稿信息9条

Call4Papers

4+阅读 · 2018年3月13日

分布式TensorFlow入门指南

分布式TensorFlow入门指南

机器学习研究会

4+阅读 · 2017年11月28日

已删除

将门创投

5+阅读 · 2017年8月15日

【学习】Hierarchical Softmax

【学习】Hierarchical Softmax

机器学习研究会

4+阅读 · 2017年8月6日

Asynchronous Federated Learning on Heterogeneous Devices: A Survey

Asynchronous Federated Learning on Heterogeneous Devices: A Survey

Arxiv

0+阅读 · 2022年1月12日

Communication-Efficient Federated Learning with Acceleration of Global Momentum

Arxiv

0+阅读 · 2022年1月10日

Communication-Efficient Federated Learning via Predictive Coding

Arxiv

0+阅读 · 2022年1月9日

DeHIN: A Decentralized Framework for Embedding Large-scale Heterogeneous Information Networks

Arxiv

0+阅读 · 2022年1月8日

Federated Optimization of Smooth Loss Functions

Arxiv

0+阅读 · 2022年1月6日

Sample Selection with Deadline Control for Efficient Federated Learning on Heterogeneous Clients

Arxiv

0+阅读 · 2022年1月5日

Quasi-Global Momentum: Accelerating Decentralized Deep Learning on Heterogeneous Data

Arxiv

4+阅读 · 2021年6月18日

Characterizing Impacts of Heterogeneity in Federated Learning upon Large-Scale Smartphone Data

Arxiv

12+阅读 · 2021年2月21日

Distributed Machine Learning on Mobile Devices: A Survey

Distributed Machine Learning on Mobile Devices: A Survey

Arxiv

37+阅读 · 2019年9月18日

Optimal Algorithms for Distributed Optimization

Arxiv

3+阅读 · 2017年12月1日

VIP会员

文章信息

相关主题

Neural Networks

相关VIP内容

【ICML2021】异质风险最小化，Heterogeneous Risk Minimization

专知会员服务

16+阅读 · 2021年5月21日

最新6篇ICLR2021篇图神经网络论文推荐

专知会员服务

57+阅读 · 2021年1月26日

最新《图理论》笔记书，98页pdf

最新《图理论》笔记书，98页pdf

专知会员服务

76+阅读 · 2020年12月27日

【WWW2020-MAGNN】异质图嵌入的集合图神经网络 MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding

【WWW2020-MAGNN】异质图嵌入的集合图神经网络 MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding

专知会员服务

116+阅读 · 2020年2月10日

【多伦多大学】神经数据服务器:用于传输学习数据的大型搜索引擎，Neural Data Server: A Large-Scale Search Engine for Transfer Learning Data

【多伦多大学】神经数据服务器:用于传输学习数据的大型搜索引擎，Neural Data Server: A Large-Scale Search Engine for Transfer Learning Data

专知会员服务

7+阅读 · 2020年1月9日

如何加速NVIDIA gpu上的训练、推理和ML应用？108页ppt，Accelerating training, inference, and ML applications on NVIDIA GPUs

如何加速NVIDIA gpu上的训练、推理和ML应用？108页ppt，Accelerating training, inference, and ML applications on NVIDIA GPUs

专知会员服务

61+阅读 · 2019年12月29日

【斯坦福大学】深度学习技巧速查清单《CS 230 - Deep Learning Tips and Tricks Cheatsheet》

【斯坦福大学】深度学习技巧速查清单《CS 230 - Deep Learning Tips and Tricks Cheatsheet》

专知会员服务

29+阅读 · 2019年12月19日

【O'Reilly TensorFlow World 2019】在NVIDIA GPU上加速训练，推理和ML应用（Accelerating training, inference, and ML applications on NVIDIA GPUs），NVIDIA，Maggie Zhang ，Nathan Luehr，Josh Romero，Pooya Davoodi，Pooya Davoodi

【O'Reilly TensorFlow World 2019】在NVIDIA GPU上加速训练，推理和ML应用（Accelerating training, inference, and ML applications on NVIDIA GPUs），NVIDIA，Maggie Zhang ，Nathan Luehr，Josh Romero，Pooya Davoodi，Pooya Davoodi

专知会员服务

7+阅读 · 2019年11月13日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

最新BERT相关论文清单，BERT-related Papers

最新BERT相关论文清单，BERT-related Papers

专知会员服务

53+阅读 · 2019年9月29日

热门VIP内容

开通专知VIP会员享更多权益服务

【CMU博士论文】数据驱动决策中的激励、信息与不确定性

DGP双粒度提示框架：图增强大模型助力欺诈检测

【ICCV2025】ESSENTIAL：用于视频类增量学习的情景记忆与语义记忆整合

唯快不破：大型语言模型高效架构综述

相关资讯

CCF推荐 | 国际会议信息10条

CCF推荐 | 国际会议信息10条

Call4Papers

8+阅读 · 2019年5月27日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

STRCF for Visual Object Tracking

STRCF for Visual Object Tracking

统计学习与视觉计算组

15+阅读 · 2018年5月29日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

人工智能 | 国际会议截稿信息9条

人工智能 | 国际会议截稿信息9条

Call4Papers

4+阅读 · 2018年3月13日

分布式TensorFlow入门指南

分布式TensorFlow入门指南

机器学习研究会

4+阅读 · 2017年11月28日

已删除

将门创投

5+阅读 · 2017年8月15日

【学习】Hierarchical Softmax

【学习】Hierarchical Softmax

机器学习研究会

4+阅读 · 2017年8月6日

相关论文

Asynchronous Federated Learning on Heterogeneous Devices: A Survey

Asynchronous Federated Learning on Heterogeneous Devices: A Survey

Arxiv

0+阅读 · 2022年1月12日

Communication-Efficient Federated Learning with Acceleration of Global Momentum

Arxiv

0+阅读 · 2022年1月10日

Communication-Efficient Federated Learning via Predictive Coding

Arxiv

0+阅读 · 2022年1月9日

DeHIN: A Decentralized Framework for Embedding Large-scale Heterogeneous Information Networks

Arxiv

0+阅读 · 2022年1月8日

Federated Optimization of Smooth Loss Functions

Arxiv

0+阅读 · 2022年1月6日

Sample Selection with Deadline Control for Efficient Federated Learning on Heterogeneous Clients

Arxiv

0+阅读 · 2022年1月5日

Quasi-Global Momentum: Accelerating Decentralized Deep Learning on Heterogeneous Data

Arxiv

4+阅读 · 2021年6月18日

Characterizing Impacts of Heterogeneity in Federated Learning upon Large-Scale Smartphone Data

Arxiv

12+阅读 · 2021年2月21日

Distributed Machine Learning on Mobile Devices: A Survey

Distributed Machine Learning on Mobile Devices: A Survey

Arxiv

37+阅读 · 2019年9月18日

Optimal Algorithms for Distributed Optimization

Arxiv

3+阅读 · 2017年12月1日

微信扫码咨询专知VIP会员