Cloud training platforms, such as Amazon Web Services and Huawei Cloud provide users with computational resources to train their deep learning jobs. Elastic training is a service embedded in cloud training platforms that dynamically scales up or down the resources allocated to a job. The core technique of an elastic training system is to best allocate limited resources among heterogeneous jobs in terms of shorter queueing delay and higher training efficiency. This paper presents an optimal resource allocator for elastic training system that leverages a mixed-integer programming (MIP) model to maximize the training progress of deep learning jobs. We take advantage of the real-world job data obtained from ModelArts, the deep learning training platform of Huawei Cloud and conduct simulation experiments to compare the optimal resource allocator with a greedy one as benchmark. Numerical results show that the proposed allocator can reduce queuing time by up to 32% and accelerate training efficiency by up to 24% relative to the greedy resource allocator, thereby greatly improving user experience with Huawei ModelArts and potentially enabling the realization of higher profits for the product. Also, the optimal resource allocator is fast in decision-making, taking merely 0.4 seconds on average.
翻译:亚马逊 Web Services 和 Huaweu Cloud 等云层培训平台为用户提供了计算资源,以培训深层学习工作。 精英培训是云层培训平台中的一项服务,它能动态地扩大或缩小分配给一项工作的资源。 弹性培训系统的核心技术是,在缩短排队延缓时间和提高培训效率方面,最佳地在多种工作之间分配有限资源。 本文为弹性培训系统提供了一个最佳资源分配站,它利用混合内网程序(MIP)模式,最大限度地提高深层学习工作的培训进度。 我们利用了从模拟艺术、Huaweweu Cloud深层学习培训平台获得的实实在在世界性工作数据,将最佳资源分配器与贪婪的资源分配器进行比较,将最佳资源分配器与贪婪的资源分配器相比较,将时间缩短到32%,并将培训效率提高至24%,从而大大改善Huawei Mod Arts 的用户经验,并有可能实现产品更高的利润。 此外,最佳资源规划师在平均决策中速度为0.4秒。