交叉安排:交叉方式中重叠的多分布式多重培训应用 (CrossoverScheduler: Overlapping Multiple Distributed Training Applications in a Crossover Manner) - 专知论文

会员服务 ·

0

Horovod · Performer · 随机梯度下降 · 图片分类 · prototype ·

2021 年 3 月 14 日

CrossoverScheduler: Overlapping Multiple Distributed Training Applications in a Crossover Manner

翻译：交叉安排:交叉方式中重叠的多分布式多重培训应用

Cheng Luo,Lei Qu,Youshan Miao,Peng Cheng,Yongqiang Xiong

Distributed deep learning workloads include throughput-intensive training tasks on the GPU clusters, where the Distributed Stochastic Gradient Descent (SGD) incurs significant communication delays after backward propagation, forces workers to wait for the gradient synchronization via a centralized parameter server or directly in decentralized workers. We present CrossoverScheduler, an algorithm that enables communication cycles of a distributed training application to be filled by other applications through pipelining communication and computation. With CrossoverScheduler, the running performance of distributed training can be significantly improved without sacrificing convergence rate and network accuracy. We achieve so by introducing Crossover Synchronization which allows multiple distributed deep learning applications to time-share the same GPU alternately. The prototype of CrossoverScheduler is built and integrated with Horovod. Experiments on a variety of distributed tasks show that CrossoverScheduler achieves 20% \times speedup for image classification tasks on ImageNet dataset.

翻译：分布式深层学习工作量包括对 GPU 群集的吞吐密集培训任务, 分散式蒸馏梯子(SGD)在后向后传播后会造成重大的通信延误, 迫使工人通过中央参数服务器等待梯度同步, 或者直接在分散的工人中进行。我们提出 CrossoverScheduler 算法, 这个算法可以让分布式培训应用程序的通信周期通过管道通信和计算由其他应用程序填充。与 CrossoverScheduler 一起, 分布式培训的运行表现可以大大改进, 同时又不牺牲聚合率和网络的准确性。我们通过引入交叉同步化实现这一点, 允许多个分布式的深层学习应用程序来交替共享同一个 GPU 。 CrosoverScheduler 原型与Horovad 一起构建和整合。在分布式任务上进行的实验显示, CrossoverScheduler 在图像网络数据集上完成图像分类任务的速度为20% 。

0

相关内容

Horovod

Horovod是针对TensorFlow，Keras，PyTorch和MXNet的分布式培训框架。Horovod的目标是使分布式深度学习快速且易于使用。

【斯坦福大学课程】2021年深度多任务学习与元学习，CS 330: Deep Multi-Task and Meta Learning

【斯坦福大学课程】2021年深度多任务学习与元学习，CS 330: Deep Multi-Task and Meta Learning

专知会员服务

110+阅读 · 2022年3月2日

最新《联邦学习Federated Learning》报告，Federated Learning

最新《联邦学习Federated Learning》报告，Federated Learning

专知会员服务

89+阅读 · 2020年12月2日

【2020新书】数据科学与机器学习导论，220页pdf

【2020新书】数据科学与机器学习导论，220页pdf

专知会员服务

81+阅读 · 2020年9月14日

【IJCAI2020】从语言图谱到常识图谱，TransOMCS: From Linguistic Graphs to Commonsense Knowledge

【IJCAI2020】从语言图谱到常识图谱，TransOMCS: From Linguistic Graphs to Commonsense Knowledge

专知会员服务

26+阅读 · 2020年5月6日

Python分布式计算，171页pdf，Distributed Computing with Python

Python分布式计算，171页pdf，Distributed Computing with Python

专知会员服务

108+阅读 · 2020年5月3日

【文献综述】分布式机器学习综述论文，33页pdf，A Survey on Distributed Machine Learning

【文献综述】分布式机器学习综述论文，33页pdf，A Survey on Distributed Machine Learning

专知会员服务

124+阅读 · 2019年12月23日

【斯坦福大学】深度学习技巧速查清单《CS 230 - Deep Learning Tips and Tricks Cheatsheet》

【斯坦福大学】深度学习技巧速查清单《CS 230 - Deep Learning Tips and Tricks Cheatsheet》

专知会员服务

29+阅读 · 2019年12月19日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

最新BERT相关论文清单，BERT-related Papers

最新BERT相关论文清单，BERT-related Papers

专知会员服务

53+阅读 · 2019年9月29日

Federated Learning: 架构

Federated Learning: 架构

AINLP

4+阅读 · 2020年9月20日

分布式并行架构Ray介绍

分布式并行架构Ray介绍

CreateAMind

10+阅读 · 2019年8月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

【TED】什么让我们生病

【TED】什么让我们生病

英语演讲视频每日一推

7+阅读 · 2019年1月23日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models with Huge Embedding Table

Arxiv

7+阅读 · 2021年4月17日

Model-Contrastive Federated Learning

Arxiv

10+阅读 · 2021年3月30日

Distributed Graph Convolutional Networks

Arxiv

19+阅读 · 2020年7月13日

Meta Learning for Causal Direction

Meta Learning for Causal Direction

Arxiv

5+阅读 · 2020年7月6日

A Survey on Distributed Machine Learning

Arxiv

45+阅读 · 2019年12月20日

Distributed Machine Learning on Mobile Devices: A Survey

Distributed Machine Learning on Mobile Devices: A Survey

Arxiv

37+阅读 · 2019年9月18日

Deep Metric Transfer for Label Propagation with Limited Annotated Data

Arxiv

3+阅读 · 2018年12月20日

GPU-Accelerated Robotic Simulation for Distributed Reinforcement Learning

GPU-Accelerated Robotic Simulation for Distributed Reinforcement Learning

Arxiv

4+阅读 · 2018年10月24日

MQGrad: Reinforcement Learning of Gradient Quantization in Parameter Server

Arxiv

4+阅读 · 2018年4月22日

Distributed Constraint Optimization Problems and Applications: A Survey

Arxiv

5+阅读 · 2018年1月11日

VIP会员

文章信息

相关主题

随机梯度下降

相关VIP内容

【斯坦福大学课程】2021年深度多任务学习与元学习，CS 330: Deep Multi-Task and Meta Learning

【斯坦福大学课程】2021年深度多任务学习与元学习，CS 330: Deep Multi-Task and Meta Learning

专知会员服务

110+阅读 · 2022年3月2日

最新《联邦学习Federated Learning》报告，Federated Learning

最新《联邦学习Federated Learning》报告，Federated Learning

专知会员服务

89+阅读 · 2020年12月2日

【2020新书】数据科学与机器学习导论，220页pdf

【2020新书】数据科学与机器学习导论，220页pdf

专知会员服务

81+阅读 · 2020年9月14日

【IJCAI2020】从语言图谱到常识图谱，TransOMCS: From Linguistic Graphs to Commonsense Knowledge

【IJCAI2020】从语言图谱到常识图谱，TransOMCS: From Linguistic Graphs to Commonsense Knowledge

专知会员服务

26+阅读 · 2020年5月6日

Python分布式计算，171页pdf，Distributed Computing with Python

Python分布式计算，171页pdf，Distributed Computing with Python

专知会员服务

108+阅读 · 2020年5月3日

【文献综述】分布式机器学习综述论文，33页pdf，A Survey on Distributed Machine Learning

【文献综述】分布式机器学习综述论文，33页pdf，A Survey on Distributed Machine Learning

专知会员服务

124+阅读 · 2019年12月23日

【斯坦福大学】深度学习技巧速查清单《CS 230 - Deep Learning Tips and Tricks Cheatsheet》

【斯坦福大学】深度学习技巧速查清单《CS 230 - Deep Learning Tips and Tricks Cheatsheet》

专知会员服务

29+阅读 · 2019年12月19日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

最新BERT相关论文清单，BERT-related Papers

最新BERT相关论文清单，BERT-related Papers

专知会员服务

53+阅读 · 2019年9月29日

热门VIP内容

开通专知VIP会员享更多权益服务

模型提取攻击与防御的系统综述：最新进展与展望

【博士论文】低维与高维空间中潜在表征的分析、建模与变换

【CMU博士论文】用于物理模拟的高效深度学习模型

大模型解决方案白皮书：社交陪伴场景全流程落地指南

相关资讯

Federated Learning: 架构

Federated Learning: 架构

AINLP

4+阅读 · 2020年9月20日

分布式并行架构Ray介绍

分布式并行架构Ray介绍

CreateAMind

10+阅读 · 2019年8月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

【TED】什么让我们生病

【TED】什么让我们生病

英语演讲视频每日一推

7+阅读 · 2019年1月23日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

相关论文

ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models with Huge Embedding Table

Arxiv

7+阅读 · 2021年4月17日

Model-Contrastive Federated Learning

Arxiv

10+阅读 · 2021年3月30日

Distributed Graph Convolutional Networks

Arxiv

19+阅读 · 2020年7月13日

Meta Learning for Causal Direction

Meta Learning for Causal Direction

Arxiv

5+阅读 · 2020年7月6日

A Survey on Distributed Machine Learning

Arxiv

45+阅读 · 2019年12月20日

Distributed Machine Learning on Mobile Devices: A Survey

Distributed Machine Learning on Mobile Devices: A Survey

Arxiv

37+阅读 · 2019年9月18日

Deep Metric Transfer for Label Propagation with Limited Annotated Data

Arxiv

3+阅读 · 2018年12月20日

GPU-Accelerated Robotic Simulation for Distributed Reinforcement Learning

GPU-Accelerated Robotic Simulation for Distributed Reinforcement Learning

Arxiv

4+阅读 · 2018年10月24日

MQGrad: Reinforcement Learning of Gradient Quantization in Parameter Server

Arxiv

4+阅读 · 2018年4月22日

Distributed Constraint Optimization Problems and Applications: A Survey

Arxiv

5+阅读 · 2018年1月11日

微信扫码咨询专知VIP会员