The complexity and dynamism of microservices pose significant challenges to system reliability, and thereby, automated troubleshooting is crucial. Effective root cause localization after anomaly detection is crucial for ensuring the reliability of microservice systems. However, two significant issues rest in existing approaches: (1) Microservices generate traces, system logs, and key performance indicators (KPIs), but existing approaches usually consider traces only, failing to understand the system fully as traces cannot depict all anomalies; (2) Troubleshooting microservices generally contains two main phases, i.e., anomaly detection and root cause localization. Existing studies regard these two phases as independent, ignoring their close correlation. Even worse, inaccurate detection results can deeply affect localization effectiveness. To overcome these limitations, we propose Eadro, the first end-to-end framework to integrate anomaly detection and root cause localization based on multi-source data for troubleshooting large-scale microservices. The key insights of Eadro are the anomaly manifestations on different data sources and the close connection between detection and localization. Thus, Eadro models intra-service behaviors and inter-service dependencies from traces, logs, and KPIs, all the while leveraging the shared knowledge of the two phases via multi-task learning. Experiments on two widely-used benchmark microservices demonstrate that Eadro outperforms state-of-the-art approaches by a large margin. The results also show the usefulness of integrating multi-source data. We also release our code and data to facilitate future research.
翻译:微观服务的复杂性和活力对系统可靠性提出了重大挑战,因此,自动排除故障至关重要。在异常点检测后,有效的根源导致本地化,对于确保微观服务系统的可靠性至关重要。然而,现有办法还存在两个重要问题:(1) 微观服务产生痕迹、系统日志和关键业绩指标(KPIs),但现有办法通常只考虑跟踪,因为跟踪不能反映所有异常点,无法充分理解系统;(2) 排除微观服务通常包含两个主要阶段,即异常点检测和根源导致本地化。现有研究认为这两个阶段是独立的,忽视了它们之间的密切关系。更糟糕的是,不准确的检测结果可能深刻影响本地化效力。为克服这些限制,我们提议Eadro,即第一个端到端框架框架,根据多种来源数据整合异常点检测和根本导致本地化,因为跟踪无法反映所有的异常点;(2) 排除微观服务通常包括两个主要阶段,即异常点检测和本地化之间的密切联系。因此,Eadrodro模型将这两个阶段视为独立阶段,忽视它们之间的密切关系。更糟糕的是,不准确的检测结果会深刻地影响本地化效果。为跟踪、日志、记录、多端点和KILISA系统的所有数据都展示了两个阶段,同时利用了我们学习阶段。