Companies build separate training and inference GPU clusters for deep learning, and use separate schedulers to manage them. This leads to problems for both training and inference: inference clusters have low GPU utilization when the traffic load is low; training jobs often experience long queueing time due to lack of resources. We introduce Aryl, a new cluster scheduler to address these problems. Aryl introduces capacity loaning to loan idle inference GPU servers for training jobs. It further exploits elastic scaling that scales a training job's GPU allocation to better utilize loaned resources. Capacity loaning and elastic scaling create new challenges to cluster management. When the loaned servers need to be returned, we need to minimize the number of job preemptions; when more GPUs become available, we need to allocate them to elastic jobs and minimize the job completion time (JCT). Aryl addresses these combinatorial problems using principled heuristics. It introduces the notion of server preemption cost which it greedily reduces during server reclaiming. It further relies on the JCT reduction value defined for each additional worker for an elastic job to solve the scheduling problem as a multiple-choice knapsack problem. Prototype implementation on a 64-GPU testbed and large-scale simulation with 15-day traces of over 50,000 production jobs show that Aryl brings 1.53x and 1.50x reductions in average queuing time and JCT, and improves cluster usage by up to 26.9% over the cluster scheduler without capacity loaning or elastic scaling.
翻译:公司为深层学习建立单独的培训和推断 GPU 群集, 并使用不同的调度器来管理这些群集。 这导致培训和推断两方面的问题: 当交通量低时, 假设群集的GPU利用率较低; 培训工作往往因缺乏资源而排队时间过长; 我们引入了Aryl, 新的群集调度器来解决这些问题。 Aryl 引入能力借出闲置的 GPU 服务器用于培训工作。 它进一步利用弹性缩放规模, 将培训岗位的 GPU 分配用于更好地利用借出的资源。 能力借出和弹性缩放规模的缩放给群集管理带来了新的挑战。 当借的服务器需要返回时, 我们需要尽可能减少工作预留空位次数; 当有更多 GPUPS出现时, 我们需要将其分配到弹性工作, 并尽可能减少工作完成时间( JCT) 。 Aryl 将服务器的预留置成本成本降低到服务器的缩放值。 1 进一步依赖为每个新增的 JCT 裁量员员定义的 JCR 减少值值值值值值值值值值值值值值值值值值 。