Boosting Distributed Machine Learning Training Through Loss-tolerant Transmission Protocol - 专知论文

会员服务 ·

0

分布式机器学习 · Machine Learning · Boosting（一种模型训练加速方式） · 模型评估 · Learning ·

2023 年 5 月 7 日

Boosting Distributed Machine Learning Training Through Loss-tolerant Transmission Protocol

翻译：暂无翻译

Zixuan Chen,Lei Shi,Xuandong Liu,Xin Ai,Sen Liu,Yang Xu

from arxiv, This paper will be published on IWQoS 2023. Preview version only

Distributed Machine Learning (DML) systems are utilized to enhance the speed of model training in data centers (DCs) and edge nodes. The Parameter Server (PS) communication architecture is commonly employed, but it faces severe long-tail latency caused by many-to-one "incast" traffic patterns, negatively impacting training throughput. To address this challenge, we design the \textbf{L}oss-tolerant \textbf{T}ransmission \textbf{P}rotocol (LTP), which permits partial loss of gradients during synchronization to avoid unneeded retransmission and contributes to faster synchronization per iteration. LTP implements loss-tolerant transmission through \textit{out-of-order transmission} and \textit{out-of-order Acknowledges (ACKs)}. LTP employs \textit{Early Close} to adjust the loss-tolerant threshold based on network conditions and \textit{Bubble Filling} for data correction to maintain training accuracy. LTP is implemented by C++ and integrated into PyTorch. Evaluations on a testbed of 8 worker nodes and one PS node demonstrate that LTP can significantly improve DML training task throughput by up to 30x compared to traditional TCP congestion controls, with no sacrifice to final accuracy.

翻译：暂无翻译

0

相关内容

分布式机器学习

分布式机器学习

分布式机器学习研究将具有大规模数据量和计算量的任务分布式地部署到多台机器上,其核心思想在于“分而治之”,有效提高了大规模数据计算的速度并节省了开销。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日

哥伦比亚大学最新《机器学习》课程，Fall-B 2020 (Machine Learning)

专知会员服务

39+阅读 · 2020年11月3日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

美国化学会 (ACS) 北京代表处招聘

美国化学会 (ACS) 北京代表处招聘

知社学术圈

11+阅读 · 2018年9月4日

【推荐】深度学习目标检测全面综述

【推荐】深度学习目标检测全面综述

机器学习研究会

21+阅读 · 2017年9月13日

【推荐】GAN架构入门综述(资源汇总)

【推荐】GAN架构入门综述(资源汇总)

机器学习研究会

10+阅读 · 2017年9月3日

Schr？dinger-Poisson方程守恒DDG方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

多离合器ISG混合动力汽车分层多模式切换协调控制与优化

国家自然科学基金

1+阅读 · 2014年12月31日

混杂固定结构控制器设计: 一种基于系统增广的框架

国家自然科学基金

0+阅读 · 2013年12月31日

流水线逐次逼近混合结构模数转换器研究

国家自然科学基金

0+阅读 · 2013年12月31日

非线性Cahn-Hilliard型方程自适应高阶稳定数值方法分析

国家自然科学基金

0+阅读 · 2013年12月31日

基于回声状态网络的有限时间非线性系统自适应最优控制

国家自然科学基金

1+阅读 · 2012年12月31日

基于聚组氨酸的多功能高分子胶束药物传递系统的设计与构建

国家自然科学基金

0+阅读 · 2011年12月31日

水声通信中自适应信道编码关键技术研究

国家自然科学基金

0+阅读 · 2011年12月31日

Jagged2high CD11bhigh 调节性树突状细胞防治cGVHD的实验研究

国家自然科学基金

0+阅读 · 2009年12月31日

用分子对接及SPR法检测尿液中膀胱癌肿瘤标志物

国家自然科学基金

0+阅读 · 2009年12月31日

Decentralized Online Federated G-Network Learning for Lightweight Intrusion Detection

Decentralized Online Federated G-Network Learning for Lightweight Intrusion Detection

Arxiv

0+阅读 · 2023年6月22日

ACC Saturator: Automatic Kernel Optimization for Directive-Based GPU Code

Arxiv

0+阅读 · 2023年6月22日

Adaptive Ensemble Q-learning: Minimizing Estimation Bias via Error Feedback

Arxiv

0+阅读 · 2023年6月20日

Human or Machine: Reflections on Turing-Inspired Testing for the Everyday

Arxiv

0+阅读 · 2023年6月20日

Secure Summation via Subset Sums: A New Primitive for Privacy-Preserving Distributed Machine Learning

Arxiv

0+阅读 · 2023年6月19日

Quadratic Functional Encryption for Secure Training in Vertical Federated Learning

Arxiv

0+阅读 · 2023年6月19日

Dynamic Size Message Scheduling for Multi-Agent Communication under Limited Bandwidth

Arxiv

0+阅读 · 2023年6月16日

Decentralized and Communication-Free Multi-Robot Navigation through Distributed Games

Arxiv

40+阅读 · 2021年9月15日

Faster Meta Update Strategy for Noise-Robust Deep Learning

Arxiv

11+阅读 · 2021年4月30日

Graph Contrastive Learning with Adaptive Augmentation

Arxiv

10+阅读 · 2021年2月26日

VIP会员

文章信息

相关主题

分布式机器学习

Machine Learning

Boosting（一种模型训练加速方式）

相关VIP内容

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日

哥伦比亚大学最新《机器学习》课程，Fall-B 2020 (Machine Learning)

专知会员服务

39+阅读 · 2020年11月3日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

操作系统智能体：基于多模态大模型（MLLM）的通用计算设备智能体综述

《美国太空军系统全生命周期建模、仿真与分析效能提升方案》最新84页报告

【博士论文】推进数据高效的深度学习：非参数 Transformer、主动测试与上下文学习

自主人工智能：未来战争是否将是自主化的？

相关资讯

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

美国化学会 (ACS) 北京代表处招聘

美国化学会 (ACS) 北京代表处招聘

知社学术圈

11+阅读 · 2018年9月4日

【推荐】深度学习目标检测全面综述

【推荐】深度学习目标检测全面综述

机器学习研究会

21+阅读 · 2017年9月13日

【推荐】GAN架构入门综述(资源汇总)

【推荐】GAN架构入门综述(资源汇总)

机器学习研究会

10+阅读 · 2017年9月3日

相关论文

Decentralized Online Federated G-Network Learning for Lightweight Intrusion Detection

Decentralized Online Federated G-Network Learning for Lightweight Intrusion Detection

Arxiv

0+阅读 · 2023年6月22日

ACC Saturator: Automatic Kernel Optimization for Directive-Based GPU Code

Arxiv

0+阅读 · 2023年6月22日

Adaptive Ensemble Q-learning: Minimizing Estimation Bias via Error Feedback

Arxiv

0+阅读 · 2023年6月20日

Human or Machine: Reflections on Turing-Inspired Testing for the Everyday

Arxiv

0+阅读 · 2023年6月20日

Secure Summation via Subset Sums: A New Primitive for Privacy-Preserving Distributed Machine Learning

Arxiv

0+阅读 · 2023年6月19日

Quadratic Functional Encryption for Secure Training in Vertical Federated Learning

Arxiv

0+阅读 · 2023年6月19日

Dynamic Size Message Scheduling for Multi-Agent Communication under Limited Bandwidth

Arxiv

0+阅读 · 2023年6月16日

Decentralized and Communication-Free Multi-Robot Navigation through Distributed Games

Arxiv

40+阅读 · 2021年9月15日

Faster Meta Update Strategy for Noise-Robust Deep Learning

Arxiv

11+阅读 · 2021年4月30日

Graph Contrastive Learning with Adaptive Augmentation

Arxiv

10+阅读 · 2021年2月26日

相关基金

Schr？dinger-Poisson方程守恒DDG方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

多离合器ISG混合动力汽车分层多模式切换协调控制与优化

国家自然科学基金

1+阅读 · 2014年12月31日

混杂固定结构控制器设计: 一种基于系统增广的框架

国家自然科学基金

0+阅读 · 2013年12月31日

流水线逐次逼近混合结构模数转换器研究

国家自然科学基金

0+阅读 · 2013年12月31日

非线性Cahn-Hilliard型方程自适应高阶稳定数值方法分析

国家自然科学基金

0+阅读 · 2013年12月31日

基于回声状态网络的有限时间非线性系统自适应最优控制

国家自然科学基金

1+阅读 · 2012年12月31日

基于聚组氨酸的多功能高分子胶束药物传递系统的设计与构建

国家自然科学基金

0+阅读 · 2011年12月31日

水声通信中自适应信道编码关键技术研究

国家自然科学基金

0+阅读 · 2011年12月31日

Jagged2high CD11bhigh 调节性树突状细胞防治cGVHD的实验研究

国家自然科学基金

0+阅读 · 2009年12月31日

用分子对接及SPR法检测尿液中膀胱癌肿瘤标志物

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员