Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial benefits. Achieving this goal requires a deep understanding of the job features and user behaviors. We present a comprehensive study about the characteristics of DL jobs and resource management. First, we perform a large-scale analysis of real-world job traces from SenseTime. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which can facilitate the cluster system designs. Second, we introduce a general-purpose framework, which manages resources based on historical data. As case studies, we design: a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job completion time by up to 6.5x; and a Cluster Energy Saving service, which improves overall cluster utilization by up to 13%.
翻译:现代GPU数据中心对于在研究界和工业界提供深层学习模式和服务至关重要。 当运行一个数据中心时,资源时间安排和管理的优化可以带来巨大的财政效益。 实现这一目标需要深入了解工作特点和用户行为。 我们提交了一份关于DL工作特点和资源管理的全面研究报告。 首先,我们对SenseTime的实际情况工作痕迹进行大规模分析。 我们从群集、工作和用户的角度发现了一些有趣的结论,这些结论可以促进群集系统的设计。 其次,我们引入了一个通用框架,根据历史数据管理资源。作为案例研究,我们设计了一种“准-短期-服务-第一排”的排期服务,可以将整个集群集的平均完成工作的时间减少到6.5x以下;以及一个集束节能服务,将集集的总体利用率提高到13%。