Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big data generated by IT Operations processes, particularly in cloud infrastructures, to provide actionable insights with the primary goal of maximizing availability. There are a wide variety of problems to address, and multiple use-cases, where AI capabilities can be leveraged to enhance operational efficiency. Here we provide a review of the AIOps vision, trends challenges and opportunities, specifically focusing on the underlying AI techniques. We discuss in depth the key types of data emitted by IT Operations activities, the scale and challenges in analyzing them, and where they can be helpful. We categorize the key AIOps tasks as - incident detection, failure prediction, root cause analysis and automated actions. We discuss the problem formulation for each task, and then present a taxonomy of techniques to solve these problems. We also identify relatively under explored topics, especially those that could significantly benefit from advances in AI literature. We also provide insights into the trends in this field, and what are the key investment opportunities.
翻译:人工智能(AI)与IT运营流程中产生的大数据相结合的AIOps旨在提供可操作的见解,主要目标是最大限度地提高可用性,特别是在云基础架构中应用。有各种各样的问题需要解决,和多个用例需要使用AI能力来增强操作效率。下面我们提供AIOps的综述、趋势、挑战和机遇,重点关注底层的AI技术。我们深入讨论了IT运营活动产生的主要数据类型,分析它们的规模和挑战,并确定哪些方面有帮助。我们将主要的AIOps任务分为-故障检测、故障预测、根本原因分析和自动化操作。我们讨论了每项任务的问题形式,然后提出了解决这些问题的方法论。我们还确定了相对较少探索的主题,特别是可以大大受益于AI文献进展的主题。我们给出了这一领域的趋势分析和关键投资机会的见解。