Large-scale interactive web services and advanced AI applications make sophisticated decisions in real-time, based on executing a massive amount of computation tasks on thousands of servers. Task schedulers, which often operate in heterogeneous and volatile environments, require high throughput, i.e., scheduling millions of tasks per second, and low latency, i.e., incurring minimal scheduling delays for millisecond-level tasks. Scheduling is further complicated by other users' workloads in a shared system, other background activities, and the diverse hardware configurations inside datacenters. We present Rosella, a new self-driving, distributed approach for task scheduling in heterogeneous clusters. Rosella automatically learns the compute environment and adjusts its scheduling policy in real-time. The solution provides high throughput and low latency simultaneously because it runs in parallel on multiple machines with minimum coordination and only performs simple operations for each scheduling decision. Our learning module monitors total system load and uses the information to dynamically determine optimal estimation strategy for the backends' compute-power. Rosella generalizes power-of-two-choice algorithms to handle heterogeneous workers, reducing the max queue length of O(log n) obtained by prior algorithms to O(log log n). We evaluate Rosella with a variety of workloads on a 32-node AWS cluster. Experimental results show that Rosella significantly reduces task response time, and adapts to environment changes quickly.
翻译:大型互动网络服务和高级AI应用程序在对数千个服务器执行大量计算任务的基础上,实时做出复杂的决定。任务调度员往往在多变和动荡的环境中运作,需要很高的输送量,即每秒安排数百万的任务,低潜伏,即对毫秒任务的排期出现最小的延误。由于其他用户在一个共享系统中的工作量、其他背景活动以及数据中心内部的各种硬件配置,排期更为复杂。我们介绍罗塞拉,一种新的自我驱动、分散在不同组群中任务时间安排分配的方法。罗塞拉自动学习计算环境并实时调整其排程政策。解决方案同时提供高排量和低延时,因为它同时运行在多台机器上,同时进行最低限度的协调,而且只能为每项排程决定进行简单的操作。我们的学习模块监测系统总负荷,并使用信息动态地决定后端反应能力的最佳时间估计战略。罗塞拉将获得的电源二选算算法用于处理离子工人的实时调整。罗塞拉自动学习环境,通过前排排程来大幅降低O型序列任务。我们通过前期的顺序对O型序列任务进行排序评估,从而显示对O-slialalalalalalalxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx