This paper presents PREVENT, an approach for predicting and localizing failures in distributed enterprise applications by combining unsupervised techniques. Software failures can have dramatic consequences in production, and thus predicting and localizing failures is the essential step to activate healing measures that limit the disruptive consequences of failures. At the state of the art, many failures can be predicted from anomalous combinations of system metrics with respect to either rules provided from domain experts or supervised learning models. However, both these approaches limit the effectiveness of current techniques to well understood types of failures that can be either captured with predefined rules or observed while trining supervised models. PREVENT integrates the core ingredients of unsupervised approaches into a novel approach to predict failures and localize failing resources, without either requiring predefined rules or training with observed failures. The results of experimenting with PREVENT on a commercially-compliant distributed cloud system indicate that PREVENT provides more stable and reliable predictions, earlier than or comparably to supervised learning approaches, without requiring long and often impractical training with failures.
翻译:本文介绍了一种利用未经监督的技术预测和确定分布式企业应用中的失败的方法,即预防,这是一种预测和确定分布式企业应用中的失败的方法。软件的失败可能会在生产中产生巨大后果,因此预测和确定性失败是启动治疗措施以限制失败的破坏性后果的关键步骤。在最新水平上,许多失败可以来自系统指标的异常组合,无论是来自域专家提供的规则还是来自受监督的学习模式。然而,这两种方法都将当前技术的有效性限制在可以以预先确定的规则捕获的或者在对受监督的模式进行筛选时所观察到的完全理解的失败类型。 预防将未经监督的方法的核心成分纳入预测失败和使资源本地化的新做法,而不需要预先确定的规则或培训来应对所观察到的失败。 与遵守商业规则的分布式云系统实验的结果表明,PRveneve提供的预测比受监督的学习方法更早或更难于或更可靠,不需要长期和往往不切实际的失败培训。