以计算机和通信任务的智能平行方式加速分配 K-FAC (Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks) - 专知论文

会员服务 ·

0

簇 · Extensibility · 近似 · MoDELS · SGD ·

2021 年 7 月 14 日

Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks

翻译：以计算机和通信任务的智能平行方式加速分配 K-FAC

Shaohuai Shi,Lin Zhang,Bo Li

from arxiv, 11 pages. Accepted to IEEE ICDCS 2021

Distributed training with synchronous stochastic gradient descent (SGD) on GPU clusters has been widely used to accelerate the training process of deep models. However, SGD only utilizes the first-order gradient in model parameter updates, which may take days or weeks. Recent studies have successfully exploited approximate second-order information to speed up the training process, in which the Kronecker-Factored Approximate Curvature (KFAC) emerges as one of the most efficient approximation algorithms for training deep models. Yet, when leveraging GPU clusters to train models with distributed KFAC (D-KFAC), it incurs extensive computation as well as introduces extra communications during each iteration. In this work, we propose D-KFAC (SPD-KFAC) with smart parallelism of computing and communication tasks to reduce the iteration time. Specifically, 1) we first characterize the performance bottlenecks of D-KFAC, 2) we design and implement a pipelining mechanism for Kronecker factors computation and communication with dynamic tensor fusion, and 3) we develop a load balancing placement for inverting multiple matrices on GPU clusters. We conduct real-world experiments on a 64-GPU cluster with 100Gb/s InfiniBand interconnect. Experimental results show that our proposed SPD-KFAC training scheme can achieve 10%-35% improvement over state-of-the-art algorithms.

翻译：与同步随机梯度梯度下降(SGD)在 GPU 群集上进行分布式 KFAC (D-KFAC) 的分散培训已被广泛用于加速深层模型的培训过程。然而, SGD 只在模型参数更新中使用第一阶梯度, 可能需要数日或数周时间。最近的研究成功地利用了近似第二阶信息来加快培训过程, Kronecker- 受控的KFAC (KFAC) 是培训深层模型最高效的近距离算法之一。然而, 当利用 GPU 群来利用分布式 KFAC (D- KFAC) 来培训模型时, 它需要大量计算并在每次迭代期间引入额外的通信。在这项工作中,我们提议D- KFAC (SPD-K-KFAC) 与智能的计算和通信任务平行, 以缩短循环时间。具体地说, 1,我们首先将 D- KFAC 的性能瓶颈描述为D- 和动态州调调调和 3) 我们开发了一种负负负负平衡配置, 在 GPU 10- PLACT 上显示我们G IM 的G 10- 机组列的模型的模型的模型上。

0

相关内容

ICML2021接受论文列表出炉！1184篇论文都在这了！

专知会员服务

92+阅读 · 2021年6月3日

Python分布式计算，171页pdf，Distributed Computing with Python

Python分布式计算，171页pdf，Distributed Computing with Python

专知会员服务

108+阅读 · 2020年5月3日

【百度】-大规模深度学习广告系统的分布式分层GPU参数服务器，Distributed Hierarchical GPU PS

专知会员服务

24+阅读 · 2020年3月15日

【ICML2020提交论文】Learning@home:众包与分散Mixture-of-Experts训练的神经网络（Learning@home: Crowdsourced Training of Large Neural Networks with Decentralized Mixture-of-Experts）

【ICML2020提交论文】Learning@home:众包与分散Mixture-of-Experts训练的神经网络（Learning@home: Crowdsourced Training of Large Neural Networks with Decentralized Mixture-of-Experts）

专知会员服务

10+阅读 · 2020年2月12日

【Google 76分钟训练万BERT最新论文】Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

【Google 76分钟训练万BERT最新论文】Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

专知会员服务

4+阅读 · 2020年1月7日

如何加速NVIDIA gpu上的训练、推理和ML应用？108页ppt，Accelerating training, inference, and ML applications on NVIDIA GPUs

如何加速NVIDIA gpu上的训练、推理和ML应用？108页ppt，Accelerating training, inference, and ML applications on NVIDIA GPUs

专知会员服务

61+阅读 · 2019年12月29日

【文献综述】分布式机器学习综述论文，33页pdf，A Survey on Distributed Machine Learning

【文献综述】分布式机器学习综述论文，33页pdf，A Survey on Distributed Machine Learning

专知会员服务

124+阅读 · 2019年12月23日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

最新BERT相关论文清单，BERT-related Papers

最新BERT相关论文清单，BERT-related Papers

专知会员服务

53+阅读 · 2019年9月29日

已删除

将门创投

4+阅读 · 2019年6月5日

On Distributed Learning with Constant Communication Bits

Arxiv

0+阅读 · 2021年9月14日

Runtime Analysis of Single- and Multi-Objective Evolutionary Algorithms for Chance Constrained Optimization Problems with Normally Distributed Random Variables

Arxiv

1+阅读 · 2021年9月13日

H2Opus: A distributed-memory multi-GPU software package for non-local operators

Arxiv

0+阅读 · 2021年9月12日

Toward Communication Efficient Adaptive Gradient Method

Arxiv

0+阅读 · 2021年9月10日

ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models with Huge Embedding Table

Arxiv

7+阅读 · 2021年4月17日

A Survey on Distributed Machine Learning

Arxiv

45+阅读 · 2019年12月20日

Distributed Machine Learning on Mobile Devices: A Survey

Distributed Machine Learning on Mobile Devices: A Survey

Arxiv

37+阅读 · 2019年9月18日

Attributed Network Embedding via Subspace Discovery

Arxiv

4+阅读 · 2019年1月14日

Accelerated Methods for Deep Reinforcement Learning

Accelerated Methods for Deep Reinforcement Learning

Arxiv

6+阅读 · 2019年1月10日

GPU-Accelerated Robotic Simulation for Distributed Reinforcement Learning

GPU-Accelerated Robotic Simulation for Distributed Reinforcement Learning

Arxiv

4+阅读 · 2018年10月24日

VIP会员

文章信息

相关主题

相关VIP内容

ICML2021接受论文列表出炉！1184篇论文都在这了！

专知会员服务

92+阅读 · 2021年6月3日

Python分布式计算，171页pdf，Distributed Computing with Python

Python分布式计算，171页pdf，Distributed Computing with Python

专知会员服务

108+阅读 · 2020年5月3日

【百度】-大规模深度学习广告系统的分布式分层GPU参数服务器，Distributed Hierarchical GPU PS

专知会员服务

24+阅读 · 2020年3月15日

【ICML2020提交论文】Learning@home:众包与分散Mixture-of-Experts训练的神经网络（Learning@home: Crowdsourced Training of Large Neural Networks with Decentralized Mixture-of-Experts）

【ICML2020提交论文】Learning@home:众包与分散Mixture-of-Experts训练的神经网络（Learning@home: Crowdsourced Training of Large Neural Networks with Decentralized Mixture-of-Experts）

专知会员服务

10+阅读 · 2020年2月12日

【Google 76分钟训练万BERT最新论文】Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

【Google 76分钟训练万BERT最新论文】Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

专知会员服务

4+阅读 · 2020年1月7日

如何加速NVIDIA gpu上的训练、推理和ML应用？108页ppt，Accelerating training, inference, and ML applications on NVIDIA GPUs

如何加速NVIDIA gpu上的训练、推理和ML应用？108页ppt，Accelerating training, inference, and ML applications on NVIDIA GPUs

专知会员服务

61+阅读 · 2019年12月29日

【文献综述】分布式机器学习综述论文，33页pdf，A Survey on Distributed Machine Learning

【文献综述】分布式机器学习综述论文，33页pdf，A Survey on Distributed Machine Learning

专知会员服务

124+阅读 · 2019年12月23日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

最新BERT相关论文清单，BERT-related Papers

最新BERT相关论文清单，BERT-related Papers

专知会员服务

53+阅读 · 2019年9月29日

热门VIP内容

开通专知VIP会员享更多权益服务

大语言模型基准综述

《自适应训练辅助系统概念导论及其在空战指挥官加速培训中的应用》125页

【剑桥博士论文】多智能体学习中的神经多样性

以色列-伊朗空战：短暂而激烈冲突的启示

相关资讯

已删除

将门创投

4+阅读 · 2019年6月5日

相关论文

On Distributed Learning with Constant Communication Bits

Arxiv

0+阅读 · 2021年9月14日

Runtime Analysis of Single- and Multi-Objective Evolutionary Algorithms for Chance Constrained Optimization Problems with Normally Distributed Random Variables

Arxiv

1+阅读 · 2021年9月13日

H2Opus: A distributed-memory multi-GPU software package for non-local operators

Arxiv

0+阅读 · 2021年9月12日

Toward Communication Efficient Adaptive Gradient Method

Arxiv

0+阅读 · 2021年9月10日

ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models with Huge Embedding Table

Arxiv

7+阅读 · 2021年4月17日

A Survey on Distributed Machine Learning

Arxiv

45+阅读 · 2019年12月20日

Distributed Machine Learning on Mobile Devices: A Survey

Distributed Machine Learning on Mobile Devices: A Survey

Arxiv

37+阅读 · 2019年9月18日

Attributed Network Embedding via Subspace Discovery

Arxiv

4+阅读 · 2019年1月14日

Accelerated Methods for Deep Reinforcement Learning

Accelerated Methods for Deep Reinforcement Learning

Arxiv

6+阅读 · 2019年1月10日

GPU-Accelerated Robotic Simulation for Distributed Reinforcement Learning

GPU-Accelerated Robotic Simulation for Distributed Reinforcement Learning

Arxiv

4+阅读 · 2018年10月24日

微信扫码咨询专知VIP会员