Incident management for cloud services is a complex process involving several steps and has a huge impact on both service health and developer productivity. On-call engineers require significant amount of domain knowledge and manual effort for root causing and mitigation of production incidents. Recent advances in artificial intelligence has resulted in state-of-the-art large language models like GPT-3.x (both GPT-3.0 and GPT-3.5), which have been used to solve a variety of problems ranging from question answering to text summarization. In this work, we do the first large-scale study to evaluate the effectiveness of these models for helping engineers root cause and mitigate production incidents. We do a rigorous study at Microsoft, on more than 40,000 incidents and compare several large language models in zero-shot, fine-tuned and multi-task setting using semantic and lexical metrics. Lastly, our human evaluation with actual incident owners show the efficacy and future potential of using artificial intelligence for resolving cloud incidents.
翻译:云服务事件管理是一个复杂的过程,涉及若干步骤,对服务健康和开发生产率都产生巨大影响。待命工程师需要大量的域内知识和人工努力来从根本上造成和减少生产事故。最近人工智能的进步产生了诸如GPT-3.x(GPT-3.x)(GPT-3.0和GPT-3.5)等最先进的大型语言模型,这些模型被用来解决各种问题,从回答问题到文本总结。在这项工作中,我们进行了第一次大规模研究,评估这些模型在帮助工程师根本原因和减少生产事故方面的有效性。我们在微软公司进行了一项严格的研究,对40 000多起事件进行了研究,并利用语义和词汇测量标准对零发、微调和多功能设置中的若干大语言模型进行了比较。最后,我们与实际事故所有人进行的人类评估显示了使用人工智能解决云事件事件的效率和未来潜力。