Acela: 可预测的数据中心一级的维持工作安排 (Acela: Predictable Datacenter-level Maintenance Job Scheduling)

Datacenter operators ensure fair and regular server maintenance by using automated processes to schedule maintenance jobs to complete within a strict time budget. Automating this scheduling problem is challenging because maintenance job duration varies based on both job type and hardware. While it is tempting to use prior machine learning techniques for predicting job duration, we find that the structure of the maintenance job scheduling problem creates a unique challenge. In particular, we show that prior machine learning methods that produce the lowest error predictions do not produce the best scheduling outcomes due to asymmetric costs. Specifically, underpredicting maintenance job duration has results in more servers being taken offline and longer server downtime than overpredicting maintenance job duration. The system cost of underprediction is much larger than that of overprediction. We present Acela, a machine learning system for predicting maintenance job duration, which uses quantile regression to bias duration predictions toward overprediction. We integrate Acela into a maintenance job scheduler and evaluate it on datasets from large-scale, production datacenters. Compared to machine learning based predictors from prior work, Acela reduces the number of servers that are taken offline by 1.87-4.28X, and reduces the server offline time by 1.40-2.80X.

翻译：数据中心操作员通过使用自动化程序将维护工作安排在严格时间预算范围内完成,确保公平和定期服务器维护公正。自动化这一时间安排问题具有挑战性,因为维护工作期限因工作类型和硬件不同而不同。虽然使用先前的机器学习技术预测工作期限很诱人,但我们发现,维护工作时间安排问题的结构造成了一个独特的挑战。特别是, 我们显示, 产生最低误差预测的先前机器学习方法不会产生因不对称成本而导致的最佳时间安排结果。具体地说, 预测不足的维护工作期限导致服务器脱机和延长服务器的下机时间, 而不是超常的维护工作期限。系统控制不足的成本远远大于超常。我们介绍用于预测维护工作期限的机器学习系统Acela, 这是一种机器学习系统, 使用二次曲线回归, 来预测超常的偏差时间。我们将 Acela 纳入一个维护工作调度器, 并在大型、生产数据中心的数据设置上评价它。与基于机器学习的先前工作的预测器相比, Acela 减少服务器的运行时间- 180 。

相关内容

Machine Learning

关注 2242

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

【USC-Aaron Chan博士答辩Slides】可信自然语言处理机器解释的生成与利用, 242页ppt，Generating and Utilizing Machine Explanations for Trustworthy NLP

专知会员服务

16+阅读 · 2022年3月13日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日