Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU datacenter. An efficient scheduler design for such GPU datacenter is crucially important to reduce the operational cost and improve resource utilization. However, traditional approaches designed for big data or high performance computing workloads can not support DL workloads to fully utilize the GPU resources. Recently, substantial schedulers are proposed to tailor for DL workloads in GPU datacenters. This paper surveys existing research efforts for both training and inference workloads. We primarily present how existing schedulers facilitate the respective workloads from the scheduling objectives and resource consumption features. Finally, we prospect several promising future research directions. More detailed summary with the surveyed paper and code links can be found at our project website: https://github.com/S-Lab-System-Group/Awesome-DL-Scheduling-Papers
翻译:深层学习(DL) 显示其在众多领域的繁荣(DL) 。 DL 模式的开发是一个耗时和资源密集的程序。 因此,专门GPU加速器已被集体建成GPU数据中心。 对这种GPU数据中心的有效调度器设计对于降低运营成本和提高资源利用至关重要。 但是,为大数据或高性能计算工作量设计的传统做法无法支持DL工作量以充分利用GPU资源。 最近,人们提议大量排程器来为GPU数据中心的DL工作量量做裁量。本文调查了现有的培训和推算工作量研究工作。 我们主要介绍了现有排程器如何根据时间安排目标和资源消耗特点促进相应的工作量。 最后,我们展望了一些有希望的未来研究方向。 与所调查的纸张和代码链接的更详细摘要可以在我们的项目网站上找到 : https://github.com/S-Lab-System-Group/Awepos-DL-Seduling-Pappers