MuxFlow: 大规模生产深度学习集群中高效且安全的 GPU 共享 (MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters) - 专知论文

会员服务 ·

0

负载 · GPU · 在线 · 误差分析 · 内存 ·

2023 年 3 月 24 日

MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters

翻译：MuxFlow: 大规模生产深度学习集群中高效且安全的 GPU 共享

Yihao Zhao,Xin Liu,Shufan Liu,Xiang Li,Yibo Zhu,Gang Huang,Xuanzhe Liu,Xin Jin

Large-scale GPU clusters are widely-used to speed up both latency-critical (online) and best-effort (offline) deep learning (DL) workloads. However, most DL clusters either dedicate each GPU to one workload or share workloads in time, leading to very low GPU resource utilization. We present MuxFlow, the first production cluster system that supports efficient and safe space-sharing for DL workloads. NVIDIA MPS provides an opportunity to share multiple workloads in space on widely-deployed NVIDIA GPUs, but it cannot guarantee the performance and safety of online workloads. MuxFlow introduces a two-level protection mechanism for memory and computation to guarantee the performance of online workloads. Based on our practical error analysis, we design a mixed error-handling mechanism to guarantee the safety of online workloads. MuxFlow further proposes dynamic streaming multiprocessor (SM) allocation and matching-based scheduling to improve the efficiency of offline workloads. MuxFlow has been deployed at CompanyX's clusters with more than 20,000 GPUs. The deployment results indicate that MuxFlow substantially improves the GPU utilization from 26$\%$ to 76$\%$, SM activity from 16$\%$ to 33$\%$, and GPU memory from 42$\%$ to 48$\%$.

翻译：大规模 GPU 集群广泛用于加速深度学习工作负载的在线和离线处理，但大多数 DL 集群要么将每个 GPU 分配给一个工作负载，要么分享工作负载的时间，导致 GPU 资源利用率非常低。我们提出了 MuxFlow，它是第一个支持 DL 工作负载高效且安全的空间共享的生产集群系统。NVIDIA MPS 为在广泛部署的 NVIDIA GPU 上空间共享多个工作负载提供了机会，但不能保证在线工作负载的性能和安全性。MuxFlow 引入了一种二级内存和计算保护机制，以保证在线工作负载的性能。基于我们的实际误差分析，我们设计了一种混合错误处理机制，以保证在线工作负载的安全性。MuxFlow 还提出了动态流多处理器分配和基于匹配的调度，以改善离线工作负载的效率。MuxFlow 已在 CompanyX 的集群中部署了超过 20,000 个 GPU。部署结果显示，MuxFlow 将 GPU 利用率从 26％提高到 76％，SM 活动从 16% 提高到 33%，GPU 内存从 42％提高到 48％。

1

相关内容

【2022新书】高效深度学习，Efficient Deep Learning Book

【2022新书】高效深度学习，Efficient Deep Learning Book

专知会员服务

125+阅读 · 2022年4月21日

【深度神经网络加速器的硬件近似技术综述】Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey

【深度神经网络加速器的硬件近似技术综述】Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey

专知会员服务

16+阅读 · 2022年3月17日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【百度】-大规模深度学习广告系统的分布式分层GPU参数服务器，Distributed Hierarchical GPU PS

专知会员服务

24+阅读 · 2020年3月15日

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

专知会员服务

19+阅读 · 2019年11月5日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

开源书：PyTorch深度学习起步

开源书：PyTorch深度学习起步

专知会员服务

51+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

TensorFlow 2.0 学习资源汇总

TensorFlow 2.0 学习资源汇总

专知会员服务

67+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

17种深度强化学习算法用Pytorch实现

17种深度强化学习算法用Pytorch实现

新智元

31+阅读 · 2019年9月16日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

深度自进化聚类：Deep Self-Evolution Clustering

深度自进化聚类：Deep Self-Evolution Clustering

我爱读PAMI

15+阅读 · 2019年4月13日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

【推荐】深度学习思维导图

【推荐】深度学习思维导图

机器学习研究会

15+阅读 · 2017年8月20日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

云存储系统中节能关键技术研究

国家自然科学基金

0+阅读 · 2015年12月31日

面向存储受限应用的GPU性能预测模型和通信优化关键技术研究

国家自然科学基金

2+阅读 · 2015年12月31日

COMET实验CDC软件发展和数据处理

国家自然科学基金

0+阅读 · 2014年12月31日

深度学习算法可重构加速器关键技术研究

国家自然科学基金

1+阅读 · 2013年12月31日

基于集群OFDM的低功耗电力线通信收发端设计

国家自然科学基金

0+阅读 · 2013年12月31日

支持大型社交网络的云存储系统

国家自然科学基金

0+阅读 · 2012年12月31日

云数据中心并行计算模型与作业调度研究

国家自然科学基金

0+阅读 · 2012年12月31日

云环境下面向大数据并行计算的工作流执行优化研究

国家自然科学基金

1+阅读 · 2012年12月31日

CUDA、OpenMP和MPI混合加速的隐式粒子模拟算法与框架研究

国家自然科学基金

1+阅读 · 2012年12月31日

基于海量数据的实时交互式系统中关键问题的研究

国家自然科学基金

0+阅读 · 2010年12月31日

Energy-Efficient GPU Clusters Scheduling for Deep Learning

Arxiv

0+阅读 · 2023年5月14日

EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

Arxiv

0+阅读 · 2023年5月11日

PEZY-SC3: A MIMD Many-core Processor for Energy-efficient Computing

Arxiv

0+阅读 · 2023年5月11日

Efficient Visual Recognition with Deep Neural Networks: A Survey on Recent Advances and New Directions

Arxiv

20+阅读 · 2021年8月30日

Trustworthy AI: A Computational Perspective

Arxiv

12+阅读 · 2021年8月19日

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

Arxiv

14+阅读 · 2021年1月31日

Differentiable Reasoning on Large Knowledge Bases and Natural Language

Arxiv

12+阅读 · 2019年12月17日

RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds

Arxiv

11+阅读 · 2019年11月25日

Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks

Arxiv

14+阅读 · 2019年8月8日

Deep Semantic Role Labeling with Self-Attention

Arxiv

13+阅读 · 2017年12月5日

VIP会员

文章信息

相关主题

相关VIP内容

【2022新书】高效深度学习，Efficient Deep Learning Book

【2022新书】高效深度学习，Efficient Deep Learning Book

专知会员服务

125+阅读 · 2022年4月21日

【深度神经网络加速器的硬件近似技术综述】Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey

【深度神经网络加速器的硬件近似技术综述】Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey

专知会员服务

16+阅读 · 2022年3月17日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【百度】-大规模深度学习广告系统的分布式分层GPU参数服务器，Distributed Hierarchical GPU PS

专知会员服务

24+阅读 · 2020年3月15日

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

专知会员服务

19+阅读 · 2019年11月5日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

开源书：PyTorch深度学习起步

开源书：PyTorch深度学习起步

专知会员服务

51+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

TensorFlow 2.0 学习资源汇总

TensorFlow 2.0 学习资源汇总

专知会员服务

67+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

扩散语言模型综述

《美陆军徒步机动作战条令手册》最新168页

【博士论文】理解神经网络的训练动态：从局部优化轨迹与特征学习视角

军事后勤数字化未来展望

相关资讯

17种深度强化学习算法用Pytorch实现

17种深度强化学习算法用Pytorch实现

新智元

31+阅读 · 2019年9月16日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

深度自进化聚类：Deep Self-Evolution Clustering

深度自进化聚类：Deep Self-Evolution Clustering

我爱读PAMI

15+阅读 · 2019年4月13日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

【推荐】深度学习思维导图

【推荐】深度学习思维导图

机器学习研究会

15+阅读 · 2017年8月20日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

相关论文

Energy-Efficient GPU Clusters Scheduling for Deep Learning

Arxiv

0+阅读 · 2023年5月14日

EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

Arxiv

0+阅读 · 2023年5月11日

PEZY-SC3: A MIMD Many-core Processor for Energy-efficient Computing

Arxiv

0+阅读 · 2023年5月11日

Efficient Visual Recognition with Deep Neural Networks: A Survey on Recent Advances and New Directions

Arxiv

20+阅读 · 2021年8月30日

Trustworthy AI: A Computational Perspective

Arxiv

12+阅读 · 2021年8月19日

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

Arxiv

14+阅读 · 2021年1月31日

Differentiable Reasoning on Large Knowledge Bases and Natural Language

Arxiv

12+阅读 · 2019年12月17日

RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds

Arxiv

11+阅读 · 2019年11月25日

Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks

Arxiv

14+阅读 · 2019年8月8日

Deep Semantic Role Labeling with Self-Attention

Arxiv

13+阅读 · 2017年12月5日

相关基金

云存储系统中节能关键技术研究

国家自然科学基金

0+阅读 · 2015年12月31日

面向存储受限应用的GPU性能预测模型和通信优化关键技术研究

国家自然科学基金

2+阅读 · 2015年12月31日

COMET实验CDC软件发展和数据处理

国家自然科学基金

0+阅读 · 2014年12月31日

深度学习算法可重构加速器关键技术研究

国家自然科学基金

1+阅读 · 2013年12月31日

基于集群OFDM的低功耗电力线通信收发端设计

国家自然科学基金

0+阅读 · 2013年12月31日

支持大型社交网络的云存储系统

国家自然科学基金

0+阅读 · 2012年12月31日

云数据中心并行计算模型与作业调度研究

国家自然科学基金

0+阅读 · 2012年12月31日

云环境下面向大数据并行计算的工作流执行优化研究

国家自然科学基金

1+阅读 · 2012年12月31日

CUDA、OpenMP和MPI混合加速的隐式粒子模拟算法与框架研究

国家自然科学基金

1+阅读 · 2012年12月31日

基于海量数据的实时交互式系统中关键问题的研究

国家自然科学基金

0+阅读 · 2010年12月31日

微信扫码咨询专知VIP会员