The momentum gained by microservices and cloud-native software architecture pushed nowadays enterprise IT towards multi-service applications. The proliferation of services and service interactions within applications, often consisting of hundreds of interacting services, makes it harder to detect failures and to identify their possible root causes, which is on the other hand crucial to promptly recover and fix applications. Various techniques have been proposed to promptly detect failures based on their symptoms, viz., observing anomalous behaviour in one or more application services, as well as to analyse logs or monitored performance of such services to determine the possible root causes for observed anomalies. The objective of this survey is to provide a structured overview and a qualitative analysis of currently available techniques for anomaly detection and root cause analysis in modern multi-service applications. Some open challenges and research directions stemming out from the analysis are also discussed.
翻译:微服务和云型软件结构所形成的势头将当今企业信息技术推向多种服务应用,由于应用软件中服务和服务互动的激增,往往由数百个互动服务组成,因此更难发现失败和查明可能的根本原因,而对于迅速恢复和修复应用软件而言,这是关键所在。提出了各种技术,以迅速发现基于其症状的失败,即观察一个或多个应用服务的异常行为,分析此类服务的日志或监测性能,以确定观察到异常现象的可能根源。这次调查的目的是对现代多种服务应用中异常现象探测和根本原因分析的现有技术进行结构化的概述和定性分析。还讨论了分析所产生的一些公开的挑战和研究方向。