Artificial Intelligence for IT Operations (AIOps) describes the process of maintaining and operating large IT systems using diverse AI-enabled methods and tools for, e.g., anomaly detection and root cause analysis, to support the remediation, optimization, and automatic initiation of self-stabilizing IT activities. The core step of any AIOps workflow is anomaly detection, typically performed on high-volume heterogeneous data such as log messages (logs), metrics (e.g., CPU utilization), and distributed traces. In this paper, we propose a method for reliable and practical anomaly detection from system logs. It overcomes the common disadvantage of related works, i.e., the need for a large amount of manually labeled training data, by building an anomaly detection model with log instructions from the source code of 1000+ GitHub projects. The instructions from diverse systems contain rich and heterogenous information about many different normal and abnormal IT events and serve as a foundation for anomaly detection. The proposed method, named ADLILog, combines the log instructions and the data from the system of interest (target system) to learn a deep neural network model through a two-phase learning procedure. The experimental results show that ADLILog outperforms the related approaches by up to 60% on the F1 score while satisfying core non-functional requirements for industrial deployments such as unsupervised design, efficient model updates, and small model sizes.
翻译:信息技术业务(AIOPs)人工智能(AIOPs)描述使用各种AI支持的方法和工具维持和操作大型信息技术系统的过程,这些方法和工具包括异常点检测和根本原因分析,以支持自我稳定信息技术活动的补救、优化和自动启动。任何AIOS工作流程的核心步骤是异常点检测,通常是在大量多种数据上进行,如日志信息(logs)、指标(例如CPU利用情况)和分布痕迹等。在本文件中,我们提出了一个从系统日志中可靠和实用地检测异常点的方法。它克服了相关工作的共同缺点,即需要大量人工标记的培训数据,即需要用1000+GitHub项目源代码的日志指示建立一个异常点检测模型。来自不同系统的指示包含关于许多不同正常和异常信息技术事件的丰富和杂乱的信息,并用作异常现象检测的基础。提议的方法名为AdLILog,将日志指令和感兴趣系统(目标系统)的数据结合起来,以便学习深层神经网络配置数据,即需要大量人工标记培训数据,需要用1000+GiHub项目源码的日志模型,同时通过两个阶段的系统学习60级的升级的系统,通过升级的系统更新技术测试程序,以更新。