DLRover: 一种自动资源推荐的弹性深度训练扩展 (DLRover: An Elastic Deep Training Extension with Auto Job Resource Recommendation)

The cloud is still a popular platform for distributed deep learning (DL) training jobs since resource sharing in the cloud can improve resource utilization and reduce overall costs. However, such sharing also brings multiple challenges for DL training jobs, e.g., high-priority jobs could impact, even interrupt, low-priority jobs. Meanwhile, most existing distributed DL training systems require users to configure the resources (i.e., the number of nodes and resources like CPU and memory allocated to each node) of jobs manually before job submission and can not adjust the job's resources during the runtime. The resource configuration of a job deeply affect this job's performance (e.g., training throughput, resource utilization, and completion rate). However, this usually leads to poor performance of jobs since users fail to provide optimal resource configuration in most cases. \system~is a distributed DL framework can auto-configure a DL job's initial resources and dynamically tune the job's resources to win the better performance. With elastic capability, \system~can effectively adjusts the resources of a job when there are performance issues detected or a job fails because of faults or eviction. Evaluations results show \system~can outperform manual well-tuned resource configurations. Furthermore, in the production Kubernetes cluster of \company, \system~reduces the medium of job completion time by 31\%, and improves the job completion rate by 6\%, CPU utilization by 15\%, and memory utilization by 20\% compared with manual configuration.

翻译：云计算仍然是分布式深度学习（DL）训练作业的流行平台，因为在云中共享资源可以提高资源利用率并降低总体成本。但是，这种共享也给DL训练作业带来了多重挑战，例如，高优先级作业可能会影响甚至中断低优先级作业。同时，大多数现有的分布式DL训练系统要求用户在作业提交之前手动配置作业的资源（即节点数和分配给每个节点的CPU和内存等资源），并且无法在运行时调整作业的资源。作业的资源配置深刻影响该作业的性能（例如，训练吞吐量、资源利用率和完成率）。然而，这通常会导致作业性能不佳，因为在大多数情况下，用户未能提供最佳资源配置。 \system 是一种分布式DL框架，可以自动配置DL作业的初始资源，并在运行时动态调整作业的资源以实现更好的性能。通过弹性能力，\system 可以在检测到性能问题或作业由于故障或逐出而失败时有效地调整作业的资源。评估结果表明，与手动调整的良好资源配置相比，\system 可以实现更好的性能。此外，在 \company 的生产 Kubernetes 集群中，\system 将作业的完成时间中位数减少了31\%，将作业的完成率提高了6\%，将CPU利用率提高了15\%，将内存利用率提高了20\%。