A major difficulty in debugging distributed systems lies in manually determining which of the many available debugging tools to use and how to query its logs. Our own study of a production debugging workflow confirms the magnitude of this burden. This paper explores whether a machine-learning model can assist developers in distributed systems debugging. We present Revelio, a debugging assistant which takes user reports and system logs as input, and outputs debugging queries that developers can use to find a bug's root cause. The key challenges lie in (1) combining inputs of different types (e.g., natural language reports and quantitative logs) and (2) generalizing to unseen faults. Revelio addresses these by employing deep neural networks to uniformly embed diverse input sources and potential queries into a high-dimensional vector space. In addition, it exploits observations from production systems to factorize query generation into two computationally and statistically simpler learning tasks. To evaluate Revelio, we built a testbed with multiple distributed applications and debugging tools. By injecting faults and training on logs and reports from 800 Mechanical Turkers, we show that Revelio includes the most helpful query in its predicted list of top-3 relevant queries 96% of the time. Our developer study confirms the utility of Revelio.
翻译:调试分布式系统的一个主要困难在于手动确定哪些现有调试工具可以使用,以及如何查询其日志。我们自己对生产调试工作流程的研究证实了这一负担的艰巨性。本文探讨了一个机器学习模型能否帮助开发者进行分布式系统调试。我们介绍了调试助手Revelio,一个调试助手,将用户报告和系统日志作为输入,以及调试查询结果,开发者可以用来查找一个错误的根本原因。关键的挑战在于(1) 将不同类型(例如自然语言报告和定量日志)的投入和(2) 概括化为不可见的错误。通过使用深神经网络将各种输入源和潜在查询统一嵌入一个高维度矢量空间,对调试读模型进行响应。此外,我们利用来自生产系统的观测,将查询生成纳入两个计算和统计上更简单的学习任务。为了评估Reverio,我们用多种分布式应用程序和调试工具建立了一个测试台。通过对800个机械土耳其人的日志和报告进行输入错误和培训,我们展示了Revelioloi程- 3 将最有帮助的检索。