SE-MOE:一个可缩放和高效的混合专家分布式培训和推断系统 (SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System) - 专知论文

会员服务 ·

0

推断 · MoDELS · Extensibility · 讲稿 · state-of-the-art ·

2022 年 5 月 20 日

SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System

翻译：SE-MOE:一个可缩放和高效的混合专家分布式培训和推断系统

Liang Shen,Zhihua Wu,WeiBao Gong,Hongxiang Hao,Yangfan Bai,HuaChao Wu,Xinxuan Wu,Haoyi Xiong,Dianhai Yu,Yanjun Ma

With the increasing diversity of ML infrastructures nowadays, distributed training over heterogeneous computing systems is desired to facilitate the production of big models. Mixture-of-Experts (MoE) models have been proposed to lower the cost of training subject to the overall size of models/data through gating and parallelism in a divide-and-conquer fashion. While DeepSpeed has made efforts in carrying out large-scale MoE training over heterogeneous infrastructures, the efficiency of training and inference could be further improved from several system aspects, including load balancing, communication/computation efficiency, and memory footprint limits. In this work, we present SE-MoE that proposes Elastic MoE training with 2D prefetch and Fusion communication over Hierarchical storage, so as to enjoy efficient parallelisms in various types. For scalable inference in a single node, especially when the model size is larger than GPU memory, SE-MoE forms the CPU-GPU memory jointly into a ring of sections to load the model, and executes the computation tasks across the memory sections in a round-robin manner for efficient inference. We carried out extensive experiments to evaluate SE-MoE, where SE-MoE successfully trains a Unified Feature Optimization (UFO) model with a Sparsely-Gated Mixture-of-Experts model of 12B parameters in 8 days on 48 A100 GPU cards. The comparison against the state-of-the-art shows that SE-MoE outperformed DeepSpeed with 33% higher throughput (tokens per second) in training and 13% higher throughput in inference in general. Particularly, under unbalanced MoE Tasks, e.g., UFO, SE-MoE achieved 64% higher throughput with 18% lower memory footprints. The code of the framework will be released on: https://github.com/PaddlePaddle/Paddle.

翻译：目前,随着ML基础设施的日益多样化,在多种计算机系统上分布的培训将有利于大型模型的制作。我们建议了混合专家模型(MoE)来降低培训成本,但培训规模要取决于模型/数据的总体规模,通过分而解的方式,通过分而解的方式,通过模型/平行的方式,通过差异式基础设施进行大规模的MOE培训,培训和推断的效率可以从若干系统方面进一步提高,包括负载平衡、通信/交接效率以及记忆足迹限制。在此工作中,我们提出了S-MoE模型(MoE)模型,建议通过2D预发和分解存储来降低模型/数据的总体规模的培训成本。在单个节点中,特别是当模型大小大于GPU记忆时,SE-MoPA将 CU记忆联合成一个包含模型的组合,在存储部分中执行计算任务,以双向部分方式进行弹性的EOE培训,在连续的SEOFO中,在连续的SEOFA中,在连续的SEOFO中,在连续的测试中进行。

0

相关内容

【2022新书】高效深度学习，Efficient Deep Learning Book

【2022新书】高效深度学习，Efficient Deep Learning Book

专知会员服务

125+阅读 · 2022年4月21日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【Google AI新论文EfficientDet】规模化高效化的物体检测，EfficientDet: Scalable and Efficient Object Detection(附pdf)

【Google AI新论文EfficientDet】规模化高效化的物体检测，EfficientDet: Scalable and Efficient Object Detection(附pdf)

专知会员服务

27+阅读 · 2019年11月24日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

IEEE TII Call For Papers

IEEE TII Call For Papers

CCF多媒体专委会

3+阅读 · 2022年3月24日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium4

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium4

中国图象图形学学会CSIG

0+阅读 · 2021年11月10日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

中国图象图形学学会CSIG

0+阅读 · 2021年11月9日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium1

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium1

中国图象图形学学会CSIG

0+阅读 · 2021年11月3日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

【ICIG2021】Latest News & Announcements of the Industry Talk1

【ICIG2021】Latest News & Announcements of the Industry Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年7月28日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

抽样环境下基于流记录的行为特征分析与多分类器识别模型研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于图像稀疏特性的图像表示、编码与重建研究

国家自然科学基金

2+阅读 · 2014年12月31日

S3AGA样本（Spitzer-SDSS Spectral Atlas of Galaxies and AGNs)及其AGN研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于多维特征参数的肝纤维化超声定量分析及建模

国家自然科学基金

0+阅读 · 2014年12月31日

肝细胞癌的数字化病理诊断关键技术

国家自然科学基金

3+阅读 · 2014年12月31日

基于时频二维训练信息的高谱效多天线TFT-OFDM技术研究

国家自然科学基金

1+阅读 · 2012年12月31日

实时安全关键系统的建模、仿真与验证

国家自然科学基金

1+阅读 · 2012年12月31日

基于低密度奇偶校验码的压缩感知系统设计与实现

国家自然科学基金

0+阅读 · 2012年12月31日

内皮素-1受体阻断剂降低慢性间歇性低氧诱发的大鼠高血压的神经机制研究

国家自然科学基金

0+阅读 · 2011年12月31日

适应纳米尺度CMOS集成电路DFM的ULTRA模型完善和偏差模拟技术研究

国家自然科学基金

0+阅读 · 2009年12月31日

An Efficiency Study for SPLADE Models

Arxiv

0+阅读 · 2022年7月8日

Neural Implicit Dictionary via Mixture-of-Expert Training

Arxiv

0+阅读 · 2022年7月8日

A Solver + Gradient Descent Training Algorithm for Deep Neural Networks

A Solver + Gradient Descent Training Algorithm for Deep Neural Networks

Arxiv

0+阅读 · 2022年7月7日

COSTARICA estimator for rollback-less systems handling in iterative co-simulation algorithms

Arxiv

0+阅读 · 2022年7月6日

Federated Neural Architecture Search

Arxiv

0+阅读 · 2022年7月6日

Machine Learning Model Sizes and the Parameter Gap

Arxiv

0+阅读 · 2022年7月5日

Neural Architecture Search without Training

Neural Architecture Search without Training

Arxiv

10+阅读 · 2021年6月11日

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

Arxiv

14+阅读 · 2021年1月31日

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

Arxiv

12+阅读 · 2020年6月23日

One for All: Neural Joint Modeling of Entities and Events

Arxiv

11+阅读 · 2018年12月1日

VIP会员

文章信息

相关主题

state-of-the-art

相关VIP内容

【2022新书】高效深度学习，Efficient Deep Learning Book

【2022新书】高效深度学习，Efficient Deep Learning Book

专知会员服务

125+阅读 · 2022年4月21日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【Google AI新论文EfficientDet】规模化高效化的物体检测，EfficientDet: Scalable and Efficient Object Detection(附pdf)

【Google AI新论文EfficientDet】规模化高效化的物体检测，EfficientDet: Scalable and Efficient Object Detection(附pdf)

专知会员服务

27+阅读 · 2019年11月24日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

操作系统智能体：基于多模态大模型（MLLM）的通用计算设备智能体综述

《美国太空军系统全生命周期建模、仿真与分析效能提升方案》最新84页报告

【博士论文】推进数据高效的深度学习：非参数 Transformer、主动测试与上下文学习

自主人工智能：未来战争是否将是自主化的？

相关资讯

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

IEEE TII Call For Papers

IEEE TII Call For Papers

CCF多媒体专委会

3+阅读 · 2022年3月24日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium4

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium4

中国图象图形学学会CSIG

0+阅读 · 2021年11月10日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

中国图象图形学学会CSIG

0+阅读 · 2021年11月9日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium1

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium1

中国图象图形学学会CSIG

0+阅读 · 2021年11月3日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

【ICIG2021】Latest News & Announcements of the Industry Talk1

【ICIG2021】Latest News & Announcements of the Industry Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年7月28日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

相关论文

An Efficiency Study for SPLADE Models

Arxiv

0+阅读 · 2022年7月8日

Neural Implicit Dictionary via Mixture-of-Expert Training

Arxiv

0+阅读 · 2022年7月8日

A Solver + Gradient Descent Training Algorithm for Deep Neural Networks

A Solver + Gradient Descent Training Algorithm for Deep Neural Networks

Arxiv

0+阅读 · 2022年7月7日

COSTARICA estimator for rollback-less systems handling in iterative co-simulation algorithms

Arxiv

0+阅读 · 2022年7月6日

Federated Neural Architecture Search

Arxiv

0+阅读 · 2022年7月6日

Machine Learning Model Sizes and the Parameter Gap

Arxiv

0+阅读 · 2022年7月5日

Neural Architecture Search without Training

Neural Architecture Search without Training

Arxiv

10+阅读 · 2021年6月11日

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

Arxiv

14+阅读 · 2021年1月31日

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

Arxiv

12+阅读 · 2020年6月23日

One for All: Neural Joint Modeling of Entities and Events

Arxiv

11+阅读 · 2018年12月1日

相关基金

抽样环境下基于流记录的行为特征分析与多分类器识别模型研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于图像稀疏特性的图像表示、编码与重建研究

国家自然科学基金

2+阅读 · 2014年12月31日

S3AGA样本（Spitzer-SDSS Spectral Atlas of Galaxies and AGNs)及其AGN研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于多维特征参数的肝纤维化超声定量分析及建模

国家自然科学基金

0+阅读 · 2014年12月31日

肝细胞癌的数字化病理诊断关键技术

国家自然科学基金

3+阅读 · 2014年12月31日

基于时频二维训练信息的高谱效多天线TFT-OFDM技术研究

国家自然科学基金

1+阅读 · 2012年12月31日

实时安全关键系统的建模、仿真与验证

国家自然科学基金

1+阅读 · 2012年12月31日

基于低密度奇偶校验码的压缩感知系统设计与实现

国家自然科学基金

0+阅读 · 2012年12月31日

内皮素-1受体阻断剂降低慢性间歇性低氧诱发的大鼠高血压的神经机制研究

国家自然科学基金

0+阅读 · 2011年12月31日

适应纳米尺度CMOS集成电路DFM的ULTRA模型完善和偏差模拟技术研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员