Deep neural networks training jobs and other iterative computations frequently include checkpoints where jobs can be canceled based on the current value of monitored metrics. While most of existing results focus on the performance of all jobs (both successfully completed and canceled), in this work we explore scheduling policies that improve the sojourn time of successful jobs, which are typically more valuable to the user. Our model assumes that each job has a known discrete size distribution (e.g., estimated from previous execution logs) where the largest size value indicates a successful completion, while other size values correspond to termination checkpoints. In the single-server case where all jobs are available for scheduling simultaneously, we prove that optimal schedules do not preempt jobs, even when preemption overhead is negligible. Based on this, we develop a scheduling policy that minimizes the sojourn time of successful jobs asymptotically, i.e., when the number of jobs grows to infinity. Through an extensive numerical study, we show that this policy performs better than existing alternatives even when the number of jobs is finite. For more realistic scenarios with multiple servers and dynamic jobs arrivals, we propose an online approach based on our single-server scheduling policy. Through an extensive simulation study, using real-world traces, we demonstrate that this online approach results in better average sojourn time for successful jobs as compared to existing techniques.
翻译:深神经网络培训工作和其他迭代计算通常包括检查站,在这些检查站,根据监测的计量值当前值可以取消工作。虽然大多数现有成果侧重于所有工作的业绩(既成功完成又取消),但在这项工作中,我们探讨改进成功工作的逗留时间的时间安排政策,这通常对用户更有价值。我们的模型假设,每个工作都有已知的离散规模分布(例如,根据以往执行日志的估计),最大规模值显示成功完成,而其他大小值则与终止的检查站相对应。在同时提供所有工作的单一服务器的情况下,我们证明最佳时间安排并不抢先工作,即使先发制人管理费用微不足道。基于这一点,我们制定了一个时间安排政策,尽可能减少成功工作的停留时间,也就是说,当工作数量逐渐增长到无限的时候。我们通过广泛的数字研究,显示这一政策比现有的替代方法要好得多,即使工作数量有限。对于更多现实的服务器和动态工作到达者来说,我们对于更多的现实情景并不先发制人。我们建议以在线方式展示一种在线方法,在一次服务器上进行更好的模拟,以模拟方式展示我们现有的单项成功的工作。