RackSched: Rack-Sched计算机微秒表(技术报告) (RackSched: A Microsecond-Scale Scheduler for Rack-Scale Computers (Technical Report)) - 专知论文

会员服务 ·

0

缩放 · 服务器 · Continuity · Performer · prototype ·

2020 年 10 月 15 日

RackSched: A Microsecond-Scale Scheduler for Rack-Scale Computers (Technical Report)

翻译：RackSched: Rack-Sched计算机微秒表(技术报告)

Hang Zhu,Kostis Kaffes,Zixu Chen,Zhenming Liu,Christos Kozyrakis,Ion Stoica,Xin Jin

Low-latency online services have strict Service Level Objectives (SLOs) that require datacenter systems to support high throughput at microsecond-scale tail latency. Dataplane operating systems have been designed to scale up multi-core servers with minimal overhead for such SLOs. However, as application demands continue to increase, scaling up is not enough, and serving larger demands requires these systems to scale out to multiple servers in a rack. We present RackSched, the first rack-level microsecond-scale scheduler that provides the abstraction of a rack-scale computer (i.e., a huge server with hundreds to thousands of cores) to an external service with network-system co-design. The core of RackSched is a two-layer scheduling framework that integrates inter-server scheduling in the top-of-rack (ToR) switch with intra-server scheduling in each server. We use a combination of analytical results and simulations to show that it provides near-optimal performance as centralized scheduling policies, and is robust for both low-dispersion and high-dispersion workloads. We design a custom switch data plane for the inter-server scheduler, which realizes power-of-k-choices, ensures request affinity, and tracks server loads accurately and efficiently. We implement a RackSched prototype on a cluster of commodity servers connected by a Barefoot Tofino switch. End-to-end experiments on a twelve-server testbed show that RackSched improves the throughput by up to 1.44x, and scales out the throughput near linearly, while maintaining the same tail latency as one server until the system is saturated.

翻译：低延迟在线服务有严格的服务级别目标(SLOs),它要求数据中心系统支持微秒级尾部悬浮的高通量。数据机操作系统的设计是为了扩大多核心服务器的规模,使这些SLOs的间接费用最小。然而,由于应用程序的需求继续增加,扩大还不够,满足更大的需求,这些系统需要扩大到一个机架中的多个服务器。我们提供了RackSched,这是第一个架子级微秒级调度系统,它提供架式计算机(即一个拥有数百至数千个核心的庞大服务器)的抽象性能,供与网络系统共同设计的外部服务使用。RackScheed的核心是一个两层的调度框架,将内部服务器的时间安排与每个机架中的服务器内部服务器的时间安排结合起来。我们使用一套分析结果和模拟组合,显示它作为集中的调度政策提供了近于最佳的运行状态,并且对于低偏差的近距离和高距离核心的服务器来说都是强大的。我们设计了一个在服务器上连接的离心机型机机轴的服务器的运行,我们设计了一个通过一个自定义的服务器的测试测试轨道来显示一个快速的服务器。

1

相关内容

【斯坦福】机器学习优化简明导论， Introduction to Optimization for Machine Learning

【斯坦福】机器学习优化简明导论， Introduction to Optimization for Machine Learning

专知会员服务

93+阅读 · 2020年5月6日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

专知会员服务

77+阅读 · 2020年2月8日

【NYU CS-GY 9223I】算法机器学习和数据科学（Algorithmic Machine Learning and Data Science），纽约大学坦顿工程学院计算机科学与工程助理教授 |Christopher Musco

【NYU CS-GY 9223I】算法机器学习和数据科学（Algorithmic Machine Learning and Data Science），纽约大学坦顿工程学院计算机科学与工程助理教授 |Christopher Musco

专知会员服务

20+阅读 · 2019年12月24日

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

专知会员服务

19+阅读 · 2019年11月5日

【O'Reilly AI Conference 2019】使用GPU和Docker容器进行Horovod和Spark深度学习（Deep learning with Horovod and Spark using GPUs and Docker containers），BlueData的联合创始人兼首席架构师Thomas Phelan

【O'Reilly AI Conference 2019】使用GPU和Docker容器进行Horovod和Spark深度学习（Deep learning with Horovod and Spark using GPUs and Docker containers），BlueData的联合创始人兼首席架构师Thomas Phelan

专知会员服务

21+阅读 · 2019年11月5日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Call for Participation: Shared Tasks in NLPCC 2019

Call for Participation: Shared Tasks in NLPCC 2019

中国计算机学会

5+阅读 · 2019年3月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

meta learning 17年：MAML SNAIL

meta learning 17年：MAML SNAIL

CreateAMind

11+阅读 · 2019年1月2日

Ray RLlib: Scalable 降龙十八掌

Ray RLlib: Scalable 降龙十八掌

CreateAMind

9+阅读 · 2018年12月28日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

spinningup.openai 强化学习资源完整

spinningup.openai 强化学习资源完整

CreateAMind

6+阅读 · 2018年12月17日

计算机类 | 11月截稿会议信息9条

计算机类 | 11月截稿会议信息9条

Call4Papers

6+阅读 · 2018年10月14日

计算机视觉领域顶会CVPR 2018 接受论文列表

计算机视觉领域顶会CVPR 2018 接受论文列表

专知

7+阅读 · 2018年5月26日

已删除

将门创投

8+阅读 · 2017年7月21日

On the Serverless Nature of Blockchains and Smart Contracts

Arxiv

0+阅读 · 2020年11月24日

Proximu$: Efficiently Scaling DNN Inference in Multi-core CPUs through Near-Cache Compute

Arxiv

0+阅读 · 2020年11月23日

Value Cards: An Educational Toolkit for Teaching Social Impacts of Machine Learning through Deliberation

Arxiv

0+阅读 · 2020年11月21日

Against Scale: Provocations and Resistances to Scale Thinking

Arxiv

0+阅读 · 2020年11月20日

Experiences from Large-Scale Model Checking: Verification of a Vehicle Control System

Arxiv

0+阅读 · 2020年11月20日

WNGrad: Learn the Learning Rate in Gradient Descent

Arxiv

0+阅读 · 2020年11月19日

Anderson acceleration of coordinate descent

Arxiv

0+阅读 · 2020年11月19日

Learning to Predict the 3D Layout of a Scene

Arxiv

0+阅读 · 2020年11月19日

Energy Aware Deep Reinforcement Learning Scheduling for Sensors Correlated in Time and Space

Arxiv

0+阅读 · 2020年11月19日

Throughput Optimal Uplink Scheduling in Heterogeneous PLC and LTE Communication for Delay Aware Smart Grid Applications

Arxiv

0+阅读 · 2020年11月19日

VIP会员

文章信息

相关主题

相关VIP内容

【斯坦福】机器学习优化简明导论， Introduction to Optimization for Machine Learning

【斯坦福】机器学习优化简明导论， Introduction to Optimization for Machine Learning

专知会员服务

93+阅读 · 2020年5月6日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

专知会员服务

77+阅读 · 2020年2月8日

【NYU CS-GY 9223I】算法机器学习和数据科学（Algorithmic Machine Learning and Data Science），纽约大学坦顿工程学院计算机科学与工程助理教授 |Christopher Musco

【NYU CS-GY 9223I】算法机器学习和数据科学（Algorithmic Machine Learning and Data Science），纽约大学坦顿工程学院计算机科学与工程助理教授 |Christopher Musco

专知会员服务

20+阅读 · 2019年12月24日

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

专知会员服务

19+阅读 · 2019年11月5日

【O'Reilly AI Conference 2019】使用GPU和Docker容器进行Horovod和Spark深度学习（Deep learning with Horovod and Spark using GPUs and Docker containers），BlueData的联合创始人兼首席架构师Thomas Phelan

【O'Reilly AI Conference 2019】使用GPU和Docker容器进行Horovod和Spark深度学习（Deep learning with Horovod and Spark using GPUs and Docker containers），BlueData的联合创始人兼首席架构师Thomas Phelan

专知会员服务

21+阅读 · 2019年11月5日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

【新书】行动，规划与学习，622页pdf

美军坦克部队反无人机新策略：主炮轰击方案

【ICML2025】免费的Fisher？通过回收平方梯度累加器近似Fisher信息矩阵

数据质量维度的实践展开：一项综述

相关资讯

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Call for Participation: Shared Tasks in NLPCC 2019

Call for Participation: Shared Tasks in NLPCC 2019

中国计算机学会

5+阅读 · 2019年3月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

meta learning 17年：MAML SNAIL

meta learning 17年：MAML SNAIL

CreateAMind

11+阅读 · 2019年1月2日

Ray RLlib: Scalable 降龙十八掌

Ray RLlib: Scalable 降龙十八掌

CreateAMind

9+阅读 · 2018年12月28日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

spinningup.openai 强化学习资源完整

spinningup.openai 强化学习资源完整

CreateAMind

6+阅读 · 2018年12月17日

计算机类 | 11月截稿会议信息9条

计算机类 | 11月截稿会议信息9条

Call4Papers

6+阅读 · 2018年10月14日

计算机视觉领域顶会CVPR 2018 接受论文列表

计算机视觉领域顶会CVPR 2018 接受论文列表

专知

7+阅读 · 2018年5月26日

已删除

将门创投

8+阅读 · 2017年7月21日

相关论文

On the Serverless Nature of Blockchains and Smart Contracts

Arxiv

0+阅读 · 2020年11月24日

Proximu$: Efficiently Scaling DNN Inference in Multi-core CPUs through Near-Cache Compute

Arxiv

0+阅读 · 2020年11月23日

Value Cards: An Educational Toolkit for Teaching Social Impacts of Machine Learning through Deliberation

Arxiv

0+阅读 · 2020年11月21日

Against Scale: Provocations and Resistances to Scale Thinking

Arxiv

0+阅读 · 2020年11月20日

Experiences from Large-Scale Model Checking: Verification of a Vehicle Control System

Arxiv

0+阅读 · 2020年11月20日

WNGrad: Learn the Learning Rate in Gradient Descent

Arxiv

0+阅读 · 2020年11月19日

Anderson acceleration of coordinate descent

Arxiv

0+阅读 · 2020年11月19日

Learning to Predict the 3D Layout of a Scene

Arxiv

0+阅读 · 2020年11月19日

Energy Aware Deep Reinforcement Learning Scheduling for Sensors Correlated in Time and Space

Arxiv

0+阅读 · 2020年11月19日

Throughput Optimal Uplink Scheduling in Heterogeneous PLC and LTE Communication for Delay Aware Smart Grid Applications

Arxiv

0+阅读 · 2020年11月19日

微信扫码咨询专知VIP会员