Root Cause Analysis (RCA) of any service-disrupting incident is one of the most critical as well as complex tasks in IT processes, especially for cloud industry leaders like Salesforce. Typically RCA investigation leverages data-sources like application error logs or service call traces. However a rich goldmine of root cause information is also hidden in the natural language documentation of the past incidents investigations by domain experts. This is generally termed as Problem Review Board (PRB) Data which constitute a core component of IT Incident Management. However, owing to the raw unstructured nature of PRBs, such root cause knowledge is not directly reusable by manual or automated pipelines for RCA of new incidents. This motivates us to leverage this widely-available data-source to build an Incident Causation Analysis (ICA) engine, using SoTA neural NLP techniques to extract targeted information and construct a structured Causal Knowledge Graph from PRB documents. ICA forms the backbone of a simple-yet-effective Retrieval based RCA for new incidents, through an Information Retrieval system to search and rank past incidents and detect likely root causes from them, given the incident symptom. In this work, we present ICA and the downstream Incident Search and Retrieval based RCA pipeline, built at Salesforce, over 2K documented cloud service incident investigations collected over a few years. We also establish the effectiveness of ICA and the downstream tasks through various quantitative benchmarks, qualitative analysis as well as domain expert's validation and real incident case studies after deployment.
翻译:任何服务中断事件的根源分析(RCA)是信息技术流程中最关键和最复杂的任务之一,特别是对于销售公司等云层行业领导人而言。通常RCA调查利用应用错误日志或服务呼叫痕迹等数据源。然而,过去事件调查的自然语言文献中也隐藏着丰富的根源信息。这通常被称为问题审查委员会数据,构成信息技术事件管理的核心组成部分。然而,由于PRB的原始非结构化性质,这种根源知识不能通过人工或自动化管道直接用于RCA新事件的重新利用。这促使我们利用这一广泛获得的数据源来建立事件真相分析引擎,利用Sota NLP 神经技术来获取有针对性的信息,并从PRB文件中建立结构化的Causal知识图。由于基于RABA的简单有效的报复性事件基础,通过信息检索系统搜索和整理过去事件或自动管道新事件,并查明目前可能存在的真相分析根源。在IMA IMA 之前,我们通过搜索系统搜索和基于 IMA 的实地调查,通过IMA 记录事件, 建立一个基于 RBA 的实地调查, 调查, 建立一个基于 CLA 记录 事件记录 事件基准, 建立一个简单记录 。