Fully supervised log anomaly detection methods suffer the heavy burden of annotating massive unlabeled log data. Recently, many semi-supervised methods have been proposed to reduce annotation costs with the help of parsed templates. However, these methods consider each keyword independently, which disregards the correlation between keywords and the contextual relationships among log sequences. In this paper, we propose a novel weakly supervised log anomaly detection framework, named LogLG, to explore the semantic connections among keywords from sequences. Specifically, we design an end-to-end iterative process, where the keywords of unlabeled logs are first extracted to construct a log-event graph. Then, we build a subgraph annotator to generate pseudo labels for unlabeled log sequences. To ameliorate the annotation quality, we adopt a self-supervised task to pre-train a subgraph annotator. After that, a detection model is trained with the generated pseudo labels. Conditioned on the classification results, we re-extract the keywords from the log sequences and update the log-event graph for the next iteration. Experiments on five benchmarks validate the effectiveness of LogLG for detecting anomalies on unlabeled log data and demonstrate that LogLG, as the state-of-the-art weakly supervised method, achieves significant performance improvements compared to existing methods.
翻译:完全监督的日志异常检测方法需要耗费大量时间对海量未标记的日志数据进行注释。最近,许多半监督方法在模板解析的帮助下降低了注释成本。然而,这些方法独立地考虑每个关键词,忽略了关键词之间的相关性和日志序列之间的上下文关系。本文提出了一种名为LogLG的新型弱监督日志异常检测框架,以探索关键字之间的语义联系和序列之间的语境关系。具体而言,我们设计了一个端到端的迭代过程,首先提取未标记日志的关键字以构建日志事件图。然后,我们构建一个子图注释器,为未标记日志序列生成伪标签。为了改善注释质量,我们采用自监督任务对子图注释器进行预训练。在此之后,使用生成的伪标签训练检测模型。根据分类结果的条件,我们重新从日志序列中提取关键字并更新日志事件图以进行下一次迭代。在五个基准测试中的实验证实LogLG能够有效地检测未标记的日志数据中的异常,并且与现有方法相比,作为最新的弱监督方法,LogLG取得了显著的性能改进。