Alerts are crucial for requesting prompt human intervention upon cloud anomalies. The quality of alerts significantly affects the cloud reliability and the cloud provider's business revenue. In practice, we observe on-call engineers being hindered from quickly locating and fixing faulty cloud services because of the vast existence of misleading, non-informative, non-actionable alerts. We call the ineffectiveness of alerts "anti-patterns of alerts". To better understand the anti-patterns of alerts and provide actionable measures to mitigate anti-patterns, in this paper, we conduct the first empirical study on the practices of mitigating anti-patterns of alerts in an industrial cloud system. We study the alert strategies and the alert processing procedure at Huawei Cloud, a leading cloud provider. Our study combines the quantitative analysis of millions of alerts in two years and a survey with eighteen experienced engineers. As a result, we summarized four individual anti-patterns and two collective anti-patterns of alerts. We also summarize four current reactions to mitigate the anti-patterns of alerts, and the general preventative guidelines for the configuration of alert strategy. Lastly, we propose to explore the automatic evaluation of the Quality of Alerts (QoA), including the indicativeness, precision, and handleability of alerts, as a future research direction that assists in the automatic detection of alerts' anti-patterns. The findings of our study are valuable for optimizing cloud monitoring systems and improving the reliability of cloud services.
翻译:警报对于在云异常情况下要求迅速进行人类干预至关重要。警报的质量严重影响云的可靠性和云供应商的商业收入。在实践中,我们观察到待命工程师由于存在大量误导性、非信息性、不可操作的警报而无法迅速定位和修复有缺陷的云服务。我们称警报无效是“反警示模式”。因此,我们总结了4个防警模式和2个集体防暴警报。我们在本文件中还总结了4个目前对减少工业云系统防气警报做法的实证研究,以及对工业云系统防云警报做法的一般预防性指导方针。我们研究了在主要云端供应商Huafweu Cloud的警报战略和警报处理程序。我们的研究将两年内数百万警报的定量分析与18个有经验的工程师的调查结合起来。结果,我们总结了4个个人防气警告模式和2个集体防云警报。我们还总结了4个目前对减少防气模式的反应,以及改进防云警报系统在工业云系统中的防火警示系统,以及预警预警战略配置的一般预防性指导方针。最后,我们提议探索了对准确性警报战略的可靠性的自动研究。