Distributed synchronized GPU training is commonly used for deep learning. The resource constraint of using fixed GPUs makes large-scale deep learning training jobs suffer, and also lowers the cluster utilization. However, incorporating resource elasticity often introduces non-determinism in model accuracy, which is mainly due to the lack of capability to isolate the model training procedure from hardware resources. We introduce EasyScale, an elastic framework that scales distributed training on heterogeneous GPUs while producing deterministic deep learning models. EasyScale follows the data-parallel training flow strictly, traces the accuracy-relevant factors carefully, utilizes the deep learning characteristics for context switching efficiently, thus achieving elastic accuracy-consistent model training. To saturate the computation capability of heterogeneous GPUs, EasyScale dynamically assigns workers based on our intra-job and inter-job scheduling policies, minimizing GPU idle time and maximizing aggregated job throughput accordingly. Deployed in an online serving cluster of CompanyA, EasyScale powers elastic deep learning training jobs to utilize free GPUs opportunistically, improving the overall cluster utilization by 62.1% without violating SLA.
翻译:使用固定 GPU 的资源制约使得大规模深层学习培训工作受到影响,并降低了集群利用率。然而,纳入资源弹性往往在模型准确性方面引入非确定性,这主要是由于缺乏将示范培训程序与硬件资源隔离开来的能力。我们引入了“简单”框架,这个弹性框架在制作确定性深层学习模型的同时,对不同 GPU 的培训进行比例分配。“简单”框架严格遵循数据平行培训流程,仔细跟踪精确相关因素,利用深层学习特点来高效转换背景,从而实现弹性准确性一致性模式培训。为了适应不同 GPU的计算能力,根据我们的工作内部和工作间时间安排政策,简单、动态地指派工人,最大限度地减少GPU闲暇时间,并相应最大限度地增加综合工作完成量。在公司A 的在线服务集群中部署“简单”的弹性深层学习能力,以便随机利用免费 GPU,在不违反苏丹解放军的情况下将总体集群利用率提高62.1%。