Failed workloads that consumed significant computational resources in time and space affect the efficiency of data centers significantly and thus limit the amount of scientific work that can be achieved. While the computational power has increased significantly over the years, detection and prediction of workload failures have lagged far behind and will become increasingly critical as the system scale and complexity further increase. In this study, we analyze workload traces collected from a production cluster and train machine learning models on a large amount of data sets to predict workload failures. Our prediction models consist of a queue-time model that estimates the probability of workload failures before execution and a runtime model that predicts failures at runtime. Evaluation results show that the queue-time model and runtime model can predict workload failures with a maximum precision score of 90.61% and 97.75%, respectively. By integrating the runtime model with the job scheduler, it helps reduce CPU time, and memory usage by up to 16.7% and 14.53%, respectively.
翻译:在时间和空间上消耗大量计算资源的失败工作量对数据中心的效率有很大影响,从而限制了可以完成的科学工作的数量。虽然计算能力多年来大幅增加,但发现和预测工作量失败的工作却远远落后于系统规模和复杂性进一步增加,并将随着系统规模和复杂性进一步增加而变得日益重要。在这项研究中,我们分析了从生产组收集的工作量痕迹,并用大量数据集对机器学习模型进行了培训,以预测工作量失败。我们的预测模型包括一个排队时间模型,该模型估计执行前工作量失败的概率,以及一个运行时间模型,预测运行时间失败。评价结果显示,排队时间模型和运行时间模型可以预测工作量失败,最高精确得分分别为90.61%和97.75%。通过将运行时间模型与工作时间安排器结合起来,它帮助将运行时间模型的时间和记忆使用分别减少16.7%和14.53%。