To ensure the performance of online service systems, their status is closely monitored with various software and system metrics. Performance anomalies represent the performance degradation issues (e.g., slow response) of the service systems. When performing anomaly detection over the metrics, existing methods often lack the merit of interpretability, which is vital for engineers and analysts to take remediation actions. Moreover, they are unable to effectively accommodate the ever-changing services in an online fashion. To address these limitations, in this paper, we propose ADSketch, an interpretable and adaptive performance anomaly detection approach based on pattern sketching. ADSketch achieves interpretability by identifying groups of anomalous metric patterns, which represent particular types of performance issues. The underlying issues can then be immediately recognized if similar patterns emerge again. In addition, an adaptive learning algorithm is designed to embrace unprecedented patterns induced by service updates or user behavior changes. The proposed approach is evaluated with public data as well as industrial data collected from a representative online service system in Huawei Cloud. The experimental results show that ADSketch outperforms state-of-the-art approaches by a significant margin, and demonstrate the effectiveness of the online algorithm in new pattern discovery. Furthermore, our approach has been successfully deployed in industrial practice.
翻译:为确保在线服务系统的运行,利用各种软件和系统衡量标准密切监测这些系统的状况。性能异常现象代表着服务系统的性能退化问题(例如反应缓慢)。在对计量标准进行反常检测时,现有方法往往缺乏可解释性的好处,而解释性对于工程师和分析师采取补救行动至关重要。此外,它们无法以在线方式有效地适应不断变化的服务。为了解决这些局限性,我们在本文件中提议ADSketch,一种基于图案的可解释和适应性性性性性性性性工作异常检测方法。ADSketch通过确定代表特定性能问题的非典型模式组(例如反应迟缓 ) 实现可解释性。当类似模式再次出现时,可立即发现潜在的问题。此外,适应性学习算法旨在涵盖服务更新或用户行为变化所引发的前所未有的模式。拟议方法用公共数据以及从Huawwei Lloud具有代表性的在线服务系统收集的工业数据来进行评估。实验结果表明,ADSKetchchtch 超越了当前最先进的方法,并展示了我们所部署的在线新发现方式的有效性。