Incident management is a key aspect of operating large-scale cloud services. To aid with faster and efficient resolution of incidents, engineering teams document frequent troubleshooting steps in the form of Troubleshooting Guides (TSGs), to be used by on-call engineers (OCEs). However, TSGs are siloed, unstructured, and often incomplete, requiring developers to manually understand and execute necessary steps. This results in a plethora of issues such as on-call fatigue, reduced productivity, and human errors. In this work, we conduct a large-scale empirical study of over 4K+ TSGs mapped to 1000s of incidents and find that TSGs are widely used and help significantly reduce mitigation efforts. We then analyze feedback on TSGs provided by 400+ OCEs and propose a taxonomy of issues that highlights significant gaps in TSG quality. To alleviate these gaps, we investigate the automation of TSGs and propose AutoTSG -- a novel framework for automation of TSGs to executable workflows by combining machine learning and program synthesis. Our evaluation of AutoTSG on 50 TSGs shows the effectiveness in both identifying TSG statements (accuracy 0.89) and parsing them for execution (precision 0.94 and recall 0.91). Lastly, we survey ten Microsoft engineers and show the importance of TSG automation and the usefulness of AutoTSG.
翻译:事故管理是操作大型云层服务的一个关键方面。为了帮助更快和高效地解决事故,工程小组记录了经常以《解决问题指南》的形式出现的故障清除步骤,供待命工程师使用。然而,技术小组是分散的,没有结构的,而且往往不完全的,需要开发者人工理解和执行必要的步骤。这导致大量的问题,如待命疲劳、生产率降低和人为错误。在这项工作中,我们对4K+TSG进行大规模的经验性研究,将事件分布到1000多起,发现技术小组被广泛使用,有助于大大减少缓解努力。我们随后分析400+OCE提供的关于技术小组的反馈,并提出强调技术小组质量方面重大差距的问题分类。为了缩小这些差距,我们调查了技术小组的自动化,并提出了AutTGGG-G的自动化新框架,以便通过将机器学习与程序综合,执行可操作的工作流程。我们对AutoTG的50 TSG的评估和大大降低减少减少努力。我们为TSG的操作效率。