Prompt and accurate detection of system anomalies is essential to ensure the reliability of software systems. Unlike manual efforts that exploit all available run-time information, existing approaches usually leverage only a single type of monitoring data (often logs or metrics) or fail to make effective use of the joint information among multi-source data. Consequently, many false predictions occur. To better understand the manifestations of system anomalies, we conduct a comprehensive empirical study based on a large amount of heterogeneous data, i.e., logs and metrics. Our study demonstrates that system anomalies could manifest distinctly in different data types. Thus, integrating heterogeneous data can help recover the complete picture of a system's health status. In this context, we propose HADES, the first work to effectively identify system anomalies based on heterogeneous data. Our approach employs a hierarchical architecture to learn a global representation of the system status by fusing log semantics and metric patterns. It captures discriminative features and meaningful interactions from multi-modal data via a novel cross-modal attention module, enabling accurate system anomaly detection. We evaluate HADES extensively on large-scale simulated and industrial datasets. The experimental results present the superiority of HADES in detecting system anomalies on heterogeneous data. We release the code and the annotated dataset for reproducibility and future research.
翻译:与利用所有可用运行时间信息的人工努力不同,现有方法通常只利用单一类型的监测数据(通常是日志或量度),或未能有效利用多来源数据之间的联合信息。因此,出现了许多虚假预测。为了更好地了解系统异常现象的表现形式,我们根据大量不同数据,即日志和度量进行一项全面的经验性研究。我们的研究显示,系统异常现象在不同数据类型中明显表现出来。因此,综合各种数据有助于恢复系统健康状况的完整图象。在这方面,我们提议,HADES是有效查明基于不同数据系统异常现象的首项工作。我们的方法采用一个等级结构,通过使用日志语法和度模式学习系统状况的全球代表性。我们通过一个新的跨模式关注模块,从多模式数据中收集了歧视性特征和有意义的互动。我们广泛评价了大规模模拟和工业数据集。我们通过实验性结果展示了ASMAES的超度和可变性数据,以探测未来数据的可变性。