Low-latency online services have strict Service Level Objectives (SLOs) that require datacenter systems to support high throughput at microsecond-scale tail latency. Dataplane operating systems have been designed to scale up multi-core servers with minimal overhead for such SLOs. However, as application demands continue to increase, scaling up is not enough, and serving larger demands requires these systems to scale out to multiple servers in a rack. We present RackSched, the first rack-level microsecond-scale scheduler that provides the abstraction of a rack-scale computer (i.e., a huge server with hundreds to thousands of cores) to an external service with network-system co-design. The core of RackSched is a two-layer scheduling framework that integrates inter-server scheduling in the top-of-rack (ToR) switch with intra-server scheduling in each server. We use a combination of analytical results and simulations to show that it provides near-optimal performance as centralized scheduling policies, and is robust for both low-dispersion and high-dispersion workloads. We design a custom switch data plane for the inter-server scheduler, which realizes power-of-k-choices, ensures request affinity, and tracks server loads accurately and efficiently. We implement a RackSched prototype on a cluster of commodity servers connected by a Barefoot Tofino switch. End-to-end experiments on a twelve-server testbed show that RackSched improves the throughput by up to 1.44x, and scales out the throughput near linearly, while maintaining the same tail latency as one server until the system is saturated.
翻译:低延迟在线服务有严格的服务级别目标(SLOs),它要求数据中心系统支持微秒级尾部悬浮的高通量。数据机操作系统的设计是为了扩大多核心服务器的规模,使这些SLOs的间接费用最小。然而,由于应用程序的需求继续增加,扩大还不够,满足更大的需求,这些系统需要扩大到一个机架中的多个服务器。我们提供了RackSched,这是第一个架子级微秒级调度系统,它提供架式计算机(即一个拥有数百至数千个核心的庞大服务器)的抽象性能,供与网络系统共同设计的外部服务使用。RackScheed的核心是一个两层的调度框架,将内部服务器的时间安排与每个机架中的服务器内部服务器的时间安排结合起来。我们使用一套分析结果和模拟组合,显示它作为集中的调度政策提供了近于最佳的运行状态,并且对于低偏差的近距离和高距离核心的服务器来说都是强大的。我们设计了一个在服务器上连接的离心机型机机轴的服务器的运行,我们设计了一个通过一个自定义的服务器的测试测试轨道来显示一个快速的服务器。