As business of Alibaba expands across the world among various industries, higher standards are imposed on the service quality and reliability of big data cloud computing platforms which constitute the infrastructure of Alibaba Cloud. However, root cause analysis in these platforms is non-trivial due to the complicated system architecture. In this paper, we propose a root cause analysis framework called CloudRCA which makes use of heterogeneous multi-source data including Key Performance Indicators (KPIs), logs, as well as topology, and extracts important features via state-of-the-art anomaly detection and log analysis techniques. The engineered features are then utilized in a Knowledge-informed Hierarchical Bayesian Network (KHBN) model to infer root causes with high accuracy and efficiency. Ablation study and comprehensive experimental comparisons demonstrate that, compared to existing frameworks, CloudRCA 1) consistently outperforms existing approaches in f1-score across different cloud systems; 2) can handle novel types of root causes thanks to the hierarchical structure of KHBN; 3) performs more robustly with respect to algorithmic configurations; and 4) scales more favorably in the data and feature sizes. Experiments also show that a cross-platform transfer learning mechanism can be adopted to further improve the accuracy by more than 10\%. CloudRCA has been integrated into the diagnosis system of Alibaba Cloud and employed in three typical cloud computing platforms including MaxCompute, Realtime Compute and Hologres. It saves Site Reliability Engineers (SREs) more than $20\%$ in the time spent on resolving failures in the past twelve months and improves service reliability significantly.
翻译:随着Alibaba的业务在世界各地各行业之间不断扩大,对构成Alibaba Cloud基础设施的大型数据云计算平台的服务质量和可靠性规定了更高的标准。然而,由于系统结构复杂,这些平台的根源分析并非三重性。在本文件中,我们提议了一个名为CloudRCA的根源分析框架,该框架利用了多种多源数据,包括关键业绩指标、日志以及地形学,并通过最先进的现场异常检测和日志分析技术提取了重要特征。然后,在知识知情的高级巴伊西亚网络(KHBN)模型中利用了设计功能,以高精确度和效率推导根源。在本文中,我们提议了一个叫CloudRCA的根源分析框架,这一框架利用了不同云系统F1核心的现有方法;2)由于KHBN的等级结构,可以处理新型的根源;3)在算式配置方面,可以更有力地评估稳定性;4)在典型的高级巴伊斯网络网络(KHAB)中,在数据和地块的精确度分析中,在数据和地块分析中可以更精确的尺度上,包括过去Alial-AlialalalalAlyAlyAly,可以更进一步地显示。