Fully supervised log anomaly detection methods suffer the heavy burden of annotating massive unlabeled log data. Recently, many semi-supervised methods have been proposed to reduce annotation costs with the help of parsed templates. However, these methods consider each keyword independently, which disregards the correlation between keywords and the contextual relationships among log sequences. In this paper, we propose a novel weakly supervised log anomaly detection framework, named LogLG, to explore the semantic connections among keywords from sequences. Specifically, we design an end-to-end iterative process, where the keywords of unlabeled logs are first extracted to construct a log-event graph. Then, we build a subgraph annotator to generate pseudo labels for unlabeled log sequences. To ameliorate the annotation quality, we adopt a self-supervised task to pre-train a subgraph annotator. After that, a detection model is trained with the generated pseudo labels. Conditioned on the classification results, we re-extract the keywords from the log sequences and update the log-event graph for the next iteration. Experiments on five benchmarks validate the effectiveness of LogLG for detecting anomalies on unlabeled log data and demonstrate that LogLG, as the state-of-the-art weakly supervised method, achieves significant performance improvements compared to existing methods.
翻译:完全监督的日志异常检测方法承受着大量未贴标签的日志数据说明的沉重负担。 最近, 许多半监督的方法被提出来, 以在解析模板的帮助下降低批注成本。 但是, 这些方法独立考虑每个关键字, 无视关键字与日志序列之间相关关系。 在本文中, 我们提议了一个新颖的、 薄弱监督的日志异常检测框架, 名为 LogLG, 以探索来自序列的关键字之间的语义连接。 具体地说, 我们设计了一个终端到终端的迭接合程序, 在其中, 未贴标签的日志关键字首先提取来构建日志活动图表。 然后, 我们建立一个子绘图说明器, 生成未贴过标记的日志序列的假标签。 为了提高批注质量, 我们采用了一个自监督的日志异常检测框架, 并用生成的伪标签模型培训。 在分类结果上, 我们从日志序列中提取了五个关键字, 并更新了用于下个日志测试的对比系统测试方法, 以测试现有系统测试现有系统测试系统测试的无效性。