For large-scale distributed systems, it's crucial to efficiently diagnose the root causes of incidents to maintain high system availability. The recent development of microservice architecture brings three major challenges (i.e., operation, system scale, and monitoring complexities) to root cause analysis (RCA) in industrial settings. To tackle these challenges, in this paper, we present Groot, an event-graph-based approach for RCA. Groot constructs a real-time causality graph based on events that summarize various types of metrics, logs, and activities in the system under analysis. Moreover, to incorporate domain knowledge from site reliability engineering (SRE) engineers, Groot can be customized with user-defined events and domain-specific rules. Currently, Groot supports RCA among 5,000 real production services and is actively used by the SRE teamin a global e-commerce system serving more than 185 million active buyers per year. Over 15 months, we collect a data setcontaining labeled root causes of 952 real production incidents for evaluation. The evaluation results show that Groot is able to achieve 95% top-3 accuracy and 78% top-1 accuracy. To share our experience in deploying and adopting RCA in industrial settings, we conduct survey to show that users of Grootfindit helpful and easy to use. We also share the lessons learnedfrom deploying and adopting Grootto solve RCA problems inproduction environments.
翻译:对于大规模分布式系统来说,有效分析事件根源至关重要,以便保持系统的高可用性。最近发展微观服务结构带来了三大挑战(即操作、系统规模和监测复杂性),以在工业环境中进行根本原因分析(RCA)。为了应对这些挑战,我们在本文件中为RCA介绍一个基于事件分布式方法的Groot。Groot根据总结所分析系统中各类计量、日志和活动的各类事件建立一套实时因果关系图。此外,Groot能够纳入来自现场可靠性工程工程师的域知识,Groot可以根据用户定义的事件和特定领域规则定制。目前,Groot支持5,000个实际生产服务中的RCA,并且由SRE团队在每年为1.85亿以上活跃购买者服务的全球电子商务系统中积极使用。在15个月中,我们收集了一套含有952个实际生产事件标签的根源的数据。评价结果表明,Groot能够实现95%的顶级三级精确度和78%的顶级一准确度。我们从部署和部署GRA中获取的有益经验,在部署和部署GRA过程中分享我们所学会的解决方案。