Aryl: 深层学习 Elastic 群集调度器 (Aryl: An Elastic Cluster Scheduler for Deep Learning) - 专知论文

会员服务 ·

0

簇 · 缩放 · 推断 · GPU · 可约的 ·

2022 年 2 月 16 日

Aryl: An Elastic Cluster Scheduler for Deep Learning

翻译：Aryl: 深层学习 Elastic 群集调度器

Jiamin Li,Hong Xu,Yibo Zhu,Zherui Liu,Chuanxiong Guo,Cong Wang

Companies build separate training and inference GPU clusters for deep learning, and use separate schedulers to manage them. This leads to problems for both training and inference: inference clusters have low GPU utilization when the traffic load is low; training jobs often experience long queueing time due to lack of resources. We introduce Aryl, a new cluster scheduler to address these problems. Aryl introduces capacity loaning to loan idle inference GPU servers for training jobs. It further exploits elastic scaling that scales a training job's GPU allocation to better utilize loaned resources. Capacity loaning and elastic scaling create new challenges to cluster management. When the loaned servers need to be returned, we need to minimize the number of job preemptions; when more GPUs become available, we need to allocate them to elastic jobs and minimize the job completion time (JCT). Aryl addresses these combinatorial problems using principled heuristics. It introduces the notion of server preemption cost which it greedily reduces during server reclaiming. It further relies on the JCT reduction value defined for each additional worker for an elastic job to solve the scheduling problem as a multiple-choice knapsack problem. Prototype implementation on a 64-GPU testbed and large-scale simulation with 15-day traces of over 50,000 production jobs show that Aryl brings 1.53x and 1.50x reductions in average queuing time and JCT, and improves cluster usage by up to 26.9% over the cluster scheduler without capacity loaning or elastic scaling.

翻译：公司为深层学习建立单独的培训和推断 GPU 群集, 并使用不同的调度器来管理这些群集。这导致培训和推断两方面的问题: 当交通量低时, 假设群集的GPU利用率较低; 培训工作往往因缺乏资源而排队时间过长; 我们引入了Aryl, 新的群集调度器来解决这些问题。 Aryl 引入能力借出闲置的 GPU 服务器用于培训工作。它进一步利用弹性缩放规模, 将培训岗位的 GPU 分配用于更好地利用借出的资源。能力借出和弹性缩放规模的缩放给群集管理带来了新的挑战。当借的服务器需要返回时, 我们需要尽可能减少工作预留空位次数; 当有更多 GPUPS出现时, 我们需要将其分配到弹性工作, 并尽可能减少工作完成时间( JCT) 。 Aryl 将服务器的预留置成本成本降低到服务器的缩放值。 1 进一步依赖为每个新增的 JCT 裁量员员定义的 JCR 减少值值值值值值值值值值值值值值值值值值。

0

相关内容

【教程】深度学习Keras与TensorFlow教程，Deep Learning with Keras and Tensorflow in R

【教程】深度学习Keras与TensorFlow教程，Deep Learning with Keras and Tensorflow in R

专知会员服务

32+阅读 · 2022年3月9日

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

专知会员服务

135+阅读 · 2021年6月16日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

ACM TOMM Call for Papers

ACM TOMM Call for Papers

CCF多媒体专委会

2+阅读 · 2022年3月23日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Workshop

【ICIG2021】Latest News & Announcements of the Workshop

中国图象图形学学会CSIG

0+阅读 · 2021年12月20日

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

20+阅读 · 2019年5月24日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

深度自进化聚类：Deep Self-Evolution Clustering

深度自进化聚类：Deep Self-Evolution Clustering

我爱读PAMI

15+阅读 · 2019年4月13日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

多层时空并行 Schwarz 算法的研究

国家自然科学基金

3+阅读 · 2017年12月31日

套子代数的Hochschild上同调及套的分类

国家自然科学基金

3+阅读 · 2014年12月31日

基于混合Petri网的电力CPS协同建模与分析

国家自然科学基金

2+阅读 · 2013年12月31日

MEMS声衬及分布式阻抗控制研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于数据分布评估和支持向量机方法的分布式数据流挖掘模型和算法研究

国家自然科学基金

1+阅读 · 2012年12月31日

基于核方法的非局部图像处理

国家自然科学基金

0+阅读 · 2012年12月31日

可重写Petri网理论及在大规模动态分布式系统中的应用

国家自然科学基金

0+阅读 · 2012年12月31日

无界Petri网分析理论与方法

国家自然科学基金

1+阅读 · 2012年12月31日

基于Sparse-Land模型的SAR图像噪声抑制与分割

国家自然科学基金

0+阅读 · 2009年12月31日

基于本体的Deep Web搜索技术

国家自然科学基金

2+阅读 · 2009年12月31日

FedChain: Chained Algorithms for Near-Optimal Communication Cost in Federated Learning

Arxiv

0+阅读 · 2022年4月20日

Multifidelity Deep Operator Networks

Arxiv

0+阅读 · 2022年4月19日

OneFlow: Redesign the Distributed Deep Learning Framework from Scratch

Arxiv

0+阅读 · 2022年4月19日

Deep Federated Learning for Autonomous Driving

Arxiv

0+阅读 · 2022年4月19日

Distributed Learning of Deep Neural Networks using Independent Subnet Training

Arxiv

2+阅读 · 2022年4月18日

Quantized Federated Learning under Transmission Delay and Outage Constraints

Arxiv

0+阅读 · 2022年4月17日

Server Free Wireless Federated Learning: Architecture, Algorithm, and Analysis

Arxiv

0+阅读 · 2022年4月15日

Analysis of Workflow Schedulers in Simulated Distributed Environments

Arxiv

0+阅读 · 2022年4月14日

Bayesian Deep Learning for Graphs

Arxiv

23+阅读 · 2022年2月24日

The Principles of Deep Learning Theory

Arxiv

65+阅读 · 2021年6月18日

VIP会员

文章信息

相关主题

相关VIP内容

【教程】深度学习Keras与TensorFlow教程，Deep Learning with Keras and Tensorflow in R

【教程】深度学习Keras与TensorFlow教程，Deep Learning with Keras and Tensorflow in R

专知会员服务

32+阅读 · 2022年3月9日

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

专知会员服务

135+阅读 · 2021年6月16日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《步兵小单元山地严寒作战指南》美军最新条令200页

《联合作战概念的发展》最新报告

俄制无人机弹药

《复杂场景下自主着陆的模型预测控制技术》92页

相关资讯

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

ACM TOMM Call for Papers

ACM TOMM Call for Papers

CCF多媒体专委会

2+阅读 · 2022年3月23日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Workshop

【ICIG2021】Latest News & Announcements of the Workshop

中国图象图形学学会CSIG

0+阅读 · 2021年12月20日

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

20+阅读 · 2019年5月24日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

深度自进化聚类：Deep Self-Evolution Clustering

深度自进化聚类：Deep Self-Evolution Clustering

我爱读PAMI

15+阅读 · 2019年4月13日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

相关论文

FedChain: Chained Algorithms for Near-Optimal Communication Cost in Federated Learning

Arxiv

0+阅读 · 2022年4月20日

Multifidelity Deep Operator Networks

Arxiv

0+阅读 · 2022年4月19日

OneFlow: Redesign the Distributed Deep Learning Framework from Scratch

Arxiv

0+阅读 · 2022年4月19日

Deep Federated Learning for Autonomous Driving

Arxiv

0+阅读 · 2022年4月19日

Distributed Learning of Deep Neural Networks using Independent Subnet Training

Arxiv

2+阅读 · 2022年4月18日

Quantized Federated Learning under Transmission Delay and Outage Constraints

Arxiv

0+阅读 · 2022年4月17日

Server Free Wireless Federated Learning: Architecture, Algorithm, and Analysis

Arxiv

0+阅读 · 2022年4月15日

Analysis of Workflow Schedulers in Simulated Distributed Environments

Arxiv

0+阅读 · 2022年4月14日

Bayesian Deep Learning for Graphs

Arxiv

23+阅读 · 2022年2月24日

The Principles of Deep Learning Theory

Arxiv

65+阅读 · 2021年6月18日

相关基金

多层时空并行 Schwarz 算法的研究

国家自然科学基金

3+阅读 · 2017年12月31日

套子代数的Hochschild上同调及套的分类

国家自然科学基金

3+阅读 · 2014年12月31日

基于混合Petri网的电力CPS协同建模与分析

国家自然科学基金

2+阅读 · 2013年12月31日

MEMS声衬及分布式阻抗控制研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于数据分布评估和支持向量机方法的分布式数据流挖掘模型和算法研究

国家自然科学基金

1+阅读 · 2012年12月31日

基于核方法的非局部图像处理

国家自然科学基金

0+阅读 · 2012年12月31日

可重写Petri网理论及在大规模动态分布式系统中的应用

国家自然科学基金

0+阅读 · 2012年12月31日

无界Petri网分析理论与方法

国家自然科学基金

1+阅读 · 2012年12月31日

基于Sparse-Land模型的SAR图像噪声抑制与分割

国家自然科学基金

0+阅读 · 2009年12月31日

基于本体的Deep Web搜索技术

国家自然科学基金

2+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员