This paper presents PREVENT, an approach for predicting and localizing failures in distributed enterprise applications by combining unsupervised techniques. Software failures can have dramatic consequences in production, and thus predicting and localizing failures is the essential step to activate healing measures that limit the disruptive consequences of failures. At the state of the art, many failures can be predicted from anomalous combinations of system metrics with respect to either rules provided from domain experts or supervised learning models. However, both these approaches limit the effectiveness of current techniques to well understood types of failures that can be either captured with predefined rules or observed while trining supervised models. PREVENT integrates the core ingredients of unsupervised approaches into a novel approach to predict failures and localize failing resources, without either requiring predefined rules or training with observed failures. The results of experimenting with PREVENT on a commercially-compliant distributed cloud system indicate that PREVENT provides more stable and reliable predictions, earlier than or comparably to supervised learning approaches, without requiring long and often impractical training with failures.
翻译:暂无翻译