AutoTSG: 解决事故问题学习和合成 (AutoTSG: Learning and Synthesis for Incident Troubleshooting)

Incident management is a key aspect of operating large-scale cloud services. To aid with faster and efficient resolution of incidents, engineering teams document frequent troubleshooting steps in the form of Troubleshooting Guides (TSGs), to be used by on-call engineers (OCEs). However, TSGs are siloed, unstructured, and often incomplete, requiring developers to manually understand and execute necessary steps. This results in a plethora of issues such as on-call fatigue, reduced productivity, and human errors. In this work, we conduct a large-scale empirical study of over 4K+ TSGs mapped to 1000s of incidents and find that TSGs are widely used and help significantly reduce mitigation efforts. We then analyze feedback on TSGs provided by 400+ OCEs and propose a taxonomy of issues that highlights significant gaps in TSG quality. To alleviate these gaps, we investigate the automation of TSGs and propose AutoTSG -- a novel framework for automation of TSGs to executable workflows by combining machine learning and program synthesis. Our evaluation of AutoTSG on 50 TSGs shows the effectiveness in both identifying TSG statements (accuracy 0.89) and parsing them for execution (precision 0.94 and recall 0.91). Lastly, we survey ten Microsoft engineers and show the importance of TSG automation and the usefulness of AutoTSG.

翻译：事故管理是操作大型云层服务的一个关键方面。为了帮助更快和高效地解决事故,工程小组记录了经常以《解决问题指南》的形式出现的故障清除步骤,供待命工程师使用。然而,技术小组是分散的,没有结构的,而且往往不完全的,需要开发者人工理解和执行必要的步骤。这导致大量的问题,如待命疲劳、生产率降低和人为错误。在这项工作中,我们对4K+TSG进行大规模的经验性研究,将事件分布到1000多起,发现技术小组被广泛使用,有助于大大减少缓解努力。我们随后分析400+OCE提供的关于技术小组的反馈,并提出强调技术小组质量方面重大差距的问题分类。为了缩小这些差距,我们调查了技术小组的自动化,并提出了AutTGGG-G的自动化新框架,以便通过将机器学习与程序综合,执行可操作的工作流程。我们对AutoTG的50 TSG的评估和大大降低减少减少努力。我们为TSG的操作效率。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日