Datacenter operators ensure fair and regular server maintenance by using automated processes to schedule maintenance jobs to complete within a strict time budget. Automating this scheduling problem is challenging because maintenance job duration varies based on both job type and hardware. While it is tempting to use prior machine learning techniques for predicting job duration, we find that the structure of the maintenance job scheduling problem creates a unique challenge. In particular, we show that prior machine learning methods that produce the lowest error predictions do not produce the best scheduling outcomes due to asymmetric costs. Specifically, underpredicting maintenance job duration has results in more servers being taken offline and longer server downtime than overpredicting maintenance job duration. The system cost of underprediction is much larger than that of overprediction. We present Acela, a machine learning system for predicting maintenance job duration, which uses quantile regression to bias duration predictions toward overprediction. We integrate Acela into a maintenance job scheduler and evaluate it on datasets from large-scale, production datacenters. Compared to machine learning based predictors from prior work, Acela reduces the number of servers that are taken offline by 1.87-4.28X, and reduces the server offline time by 1.40-2.80X.
翻译:数据中心操作员通过使用自动化程序将维护工作安排在严格时间预算范围内完成,确保公平和定期服务器维护公正。 自动化这一时间安排问题具有挑战性,因为维护工作期限因工作类型和硬件不同而不同。 虽然使用先前的机器学习技术预测工作期限很诱人,但我们发现,维护工作时间安排问题的结构造成了一个独特的挑战。 特别是, 我们显示, 产生最低误差预测的先前机器学习方法不会产生因不对称成本而导致的最佳时间安排结果。 具体地说, 预测不足的维护工作期限导致服务器脱机和延长服务器的下机时间, 而不是超常的维护工作期限。 系统控制不足的成本远远大于超常。 我们介绍用于预测维护工作期限的机器学习系统Acela, 这是一种机器学习系统, 使用二次曲线回归, 来预测超常的偏差时间。 我们将 Acela 纳入一个维护工作调度器, 并在大型、 生产数据中心的数据设置上评价它。 与基于机器学习的先前工作的预测器相比, Acela 减少服务器的运行时间- 180 。