Despite engineering workflows that aim to prevent buggy code from being deployed, bugs still make their way into the Facebook app. When symptoms of these bugs, such as user submitted reports and automatically captured crashes, are reported, finding their root causes is an important step in resolving them. However, at Facebook's scale of billions of users, a single bug can manifest as several different symptoms according to the various user and execution environments in which the software is deployed. Root cause analysis (RCA) therefore requires tedious manual investigation and domain expertise to extract out common patterns that are observed in groups of reports and use them for debugging. We propose Minesweeper, a technique for RCA that moves towards automatically identifying the root cause of bugs from their symptoms. The method is based on two key aspects: (i) a scalable algorithm to efficiently mine patterns from telemetric information that is collected along with the reports, and (ii) statistical notions of precision and recall of patterns that help point towards root causes. We evaluate Minesweeper's scalability and effectiveness in finding root causes from symptoms on real world bug and crash reports from Facebook's apps. Our evaluation demonstrates that Minesweeper can perform RCA for tens of thousands of reports in less than 3 minutes, and is more than 85% accurate in identifying the root cause of regressions.
翻译:尽管有旨在防止错误代码被部署的工程工作流程,但错误仍然会进入Facebook应用程序。当这些错误的症状,如用户提交的报告和自动捕获的碰撞等,被报告时,发现其根源是解决这些问题的一个重要步骤。然而,在Facebook上数十亿用户的规模上,一个错误可以按照软件部署的不同用户和执行环境,以不同的症状表现为几种不同的症状。因此,根源分析(RCA)需要烦琐的人工调查和域域内专门知识,以提取在报告组中观察到的通用模式,并利用这些模式进行调试。我们建议MineREBer,这是RCA的一种技术,从它们的症状中自动查明错误的根源。这个方法基于两个关键方面:(一) 从与报告一起收集的遥测信息中,一个可扩缩的算法,以高效的采矿模式,以及(二) 精确的统计概念和回顾有助于指出根源原因的模式。我们评估地雷清除者在从真实的世界错误和崩溃报告中找到根源原因的可及有效性。我们建议RMeineRefer,在Facebook上,“85 ” 和“历史记录”的精确性原因比“为18”。