To run a cloud application with the required service quality, operators have to continuously monitor the cloud application's run-time status, detect potential performance anomalies, and diagnose the root causes of anomalies. However, existing models of performance anomaly detection often suffer from low re-usability and robustness due to the diversity of system-level metrics being monitored and the lack of high-quality labeled monitoring data for anomalies. Moreover, the current coarse-grained analysis models make it difficult to locate system-level root causes of the application performance anomalies for effective adaptation decisions. We provide a FIne-grained Robust pErformance Diagnosis (FIRED) framework to tackle those challenges. The framework offers an ensemble of several well-selected base models for anomaly detection using a deep neural network, which adopts weakly-supervised learning considering fewer labels exist in reality. The framework also employs a real-time fine-grained analysis model to locate dependent system metrics of the anomaly. Our experiments show that the framework can achieve the best detection accuracy and algorithm robustness, and it can predict anomalies in four minutes with F1 score higher than 0.8. In addition, the framework can accurately localize the first root causes, and with an average accuracy higher than 0.7 of locating first four root causes.
翻译:运行一个具备所需服务质量的云应用程序,操作员必须持续监测云应用的运行时间状态,发现潜在的性能异常,并分析异常的根源;然而,由于系统级测量标准的多样性,以及缺乏高质量的异常现象的标签监测数据,现有性能异常检测模型往往具有较低的再使用性和稳健性;此外,目前粗糙的分析模型使得难以确定系统级应用性能异常的根源,以便做出有效的适应决定;我们为应对这些挑战提供了一个FInegrained Robust pErrapendance Diagnosis(FIRRED)框架;该框架提供了利用深度神经网络对一些选择良好的基本模型进行低度和强性能检测,这些模型采用了薄弱的超强性能学习,因为现实中存在较少的标签;框架还采用了实时精密分析模型,以确定异常现象的依附系统性能测量标准。我们的实验表明,框架可以实现最佳的检测准确性和算法性能分析,并且可以在四分钟内以F1的准确度、高于0.8的地根根原因预测异常情况。