Fault diagnosis is critical in many domains, as faults may lead to safety threats or economic losses. In the field of online service systems, operators rely on enormous monitoring data to detect and mitigate failures. Quickly recognizing a small set of root cause indicators for the underlying fault can save much time for failure mitigation. In this paper, we formulate the root cause analysis problem as a new causal inference task named intervention recognition. We proposed a novel unsupervised causal inference-based method named Causal Inference-based Root Cause Analysis (CIRCA). The core idea is a sufficient condition for a monitoring variable to be a root cause indicator, i.e., the change of probability distribution conditioned on the parents in the Causal Bayesian Network (CBN). Towards the application in online service systems, CIRCA constructs a graph among monitoring metrics based on the knowledge of system architecture and a set of causal assumptions. The simulation study illustrates the theoretical reliability of CIRCA. The performance on a real-world dataset further shows that CIRCA can improve the recall of the top-1 recommendation by 25% over the best baseline method.
翻译:在许多领域,过失诊断至关重要,因为缺陷可能导致安全威胁或经济损失。在在线服务系统领域,操作者依靠巨大的监测数据来发现和减轻故障。快速识别一小套基本缺陷的根源指标可以节省大量时间来减轻故障。在本文中,我们将根本原因分析问题作为新的因果推论任务,称为“干预识别”。我们提议了一个新的未经监督的基于因果推论的方法,名为“基于因果关系的根因果分析 ” ( CIRCA ) 。核心思想是监测变量成为根本原因指标的充分条件,即Causal Bayesian 网络(CBN) 父母的概率分布变化。为了在网上服务系统中的应用,CIRCA根据系统结构知识和一套因果假设,在监测指标中绘制了一张图表。模拟研究显示了CIRCA的理论可靠性。真实世界数据集的绩效进一步表明,CIRCA可以比最佳基线方法将头一建议的回顾率提高25%。