Dynamic adaptation has become an essential technique in accelerating distributed machine learning (ML) training. Recent studies have shown that dynamically adjusting model structure (e.g., lottery ticket hypothesis) or hyperparameters (e.g., batch size) can significantly accelerate training without sacrificing accuracy. However, existing ML cluster schedulers are not designed to handle dynamic adaptation. We show that existing schemes fail to provide fairness and degrade system efficiency when the training throughput changes over time under dynamic adaptation. We design Shockwave, a scheduler with future planning that builds on two key ideas. First, Shockwave extends classic market theory from static settings to dynamic settings to co-optimize efficiency and fairness. Second, Shockwave utilizes stochastic dynamic programming to handle dynamic changes. We build a system for Shockwave and validate its performance with both trace-driven simulation and cluster experiments. Results show that for traces of ML jobs with dynamic adaptation, Shockwave improves makespan by 1.3X and fairness by 2X when compared with existing fair scheduling schemes.
翻译:最近的研究表明,动态调整模型结构(如彩票假设)或超参数(如批量大小)动态调整模型结构(如彩票假设)可以大大加快培训,而不会牺牲准确性;然而,现有的ML集束调度器的设计无法处理动态适应。我们显示,在动态适应下,现有计划无法提供公平性和降低系统效率,因为随着培训量的变化而随着时间的变化,在动态适应下,培训量的变化。我们设计了震波,它是一个基于两个关键理念的未来规划的调度器。首先,震波将典型的市场理论从静态环境扩大到动态环境,以便共同优化效率和公平性。第二,震波利用随机动态动态程序处理动态变化。我们建立了震荡波系统,并通过微动模拟和集束实验来验证其性能。结果显示,与现有的公平列表计划相比,对ML工作进行动态适应的跟踪,震波以1.3X的速度改善,以2X的公平度提高。