Due to the complexity of modern IT services, failures can be manifold, occur at any stage, and are hard to detect. For this reason, anomaly detection applied to monitoring data such as logs allows gaining relevant insights to improve IT services steadily and eradicate failures. However, existing anomaly detection methods that provide high accuracy often rely on labeled training data, which are time-consuming to obtain in practice. Therefore, we propose PULL, an iterative log analysis method for reactive anomaly detection based on estimated failure time windows provided by monitoring systems instead of labeled data. Our attention-based model uses a novel objective function for weak supervision deep learning that accounts for imbalanced data and applies an iterative learning strategy for positive and unknown samples (PU learning) to identify anomalous logs. Our evaluation shows that PULL consistently outperforms ten benchmark baselines across three different datasets and detects anomalous log messages with an F1-score of more than 0.99 even within imprecise failure time windows.
翻译:由于现代信息技术服务的复杂性,故障可能是多方面的,在任何阶段都会发生,而且难以检测。因此,对日志等监测数据应用异常检测,可以取得相关的洞察力,从而稳步改善信息技术服务并消除故障。然而,现有的异常检测方法,如果提供高精确度,往往依赖标签式培训数据,而这些数据在实践中需要花费大量时间。因此,我们建议采用POLL,即基于监测系统提供的故障估计时间窗口而不是标签式数据的一种反应式异常检测迭代日志分析方法。我们基于关注的模型使用一种新的客观功能,对监管不力的深度学习进行新的目标功能,这种功能考虑到不平衡的数据,并对正态和未知的样本采用迭代学习战略(PU学习)来识别异常日志。我们的评估表明,POL始终高于三个不同的数据集的10个基准基线,并检测F1芯的异常日志信息,即使在不精确的故障时间窗口内也超过0.99。