Prompt and accurate detection of system anomalies is essential to ensure the reliability of software systems. Unlike manual efforts that exploit all available run-time information, existing approaches usually leverage only a single type of monitoring data (often logs or metrics) or fail to make effective use of the joint information among different types of data. Consequently, many false predictions occur. To better understand the manifestations of system anomalies, we conduct a systematical study on a large amount of heterogeneous data, i.e., logs and metrics. Our study demonstrates that logs and metrics can manifest system anomalies collaboratively and complementarily, and neither of them only is sufficient. Thus, integrating heterogeneous data can help recover the complete picture of a system's health status. In this context, we propose Hades, the first end-to-end semi-supervised approach to effectively identify system anomalies based on heterogeneous data. Our approach employs a hierarchical architecture to learn a global representation of the system status by fusing log semantics and metric patterns. It captures discriminative features and meaningful interactions from heterogeneous data via a cross-modal attention module, trained in a semi-supervised manner. We evaluate Hades extensively on large-scale simulated data and datasets from Huawei Cloud. The experimental results present the effectiveness of our model in detecting system anomalies. We also release the code and the annotated dataset for replication and future research.
翻译:与利用所有可用运行时间信息的手工努力不同,现有方法通常只利用单一类型的监测数据(通常是日志或量度),或未能有效利用不同类型数据之间的联合信息。因此,出现许多虚假预测。为了更好地了解系统异常现象的表现形式,我们系统研究大量不同数据,即日志和度量制,以确保软件系统的可靠性。我们的研究显示,日志和度量制可以协同和互补地显示系统异常,而两者都不够。因此,整合混杂数据有助于恢复系统健康状况的完整图象(通常是日志或量度),或无法有效使用不同类型数据之间的联合信息。因此,我们建议了哈迪斯(Hades),第一个端到端的半监督方法,以有效查明基于不同数据的各种系统异常现象。我们的方法采用了一个等级结构,通过使用日志语和度模式来了解系统状况的全球代表性。我们用一个跨模式关注模块,通过不同数据来捕捉歧视特征和有意义的互动,该模块是经过半监测的。我们用半模型方式对数据进行的培训,还用模拟地评估了现在的云式数据的模拟和滚式数据结果。我们用模拟了模拟了模拟的模拟数据。