AID:大规模云系统依赖性综合强度的有效预测 (AID: Efficient Prediction of Aggregated Intensity of Dependency in Large-scale Cloud Systems)

Service reliability is one of the key challenges that cloud providers have to deal with. In cloud systems, unplanned service failures may cause severe cascading impacts on their dependent services, deteriorating customer satisfaction. Predicting the cascading impacts accurately and efficiently is critical to the operation and maintenance of cloud systems. Existing approaches identify whether one service depends on another via distributed tracing but no prior work focused on discriminating to what extent the dependency between cloud services is. In this paper, we survey the outages and the procedure for failure diagnosis in two cloud providers to motivate the definition of the intensity of dependency. We define the intensity of dependency between two services as how much the status of the callee service influences the caller service. Then we propose AID, the first approach to predict the intensity of dependencies between cloud services. AID first generates a set of candidate dependency pairs from the spans. AID then represents the status of each cloud service with a multivariate time series aggregated from the spans. With the representation of services, AID calculates the similarities between the statuses of the caller and the callee of each candidate pair. Finally, AID aggregates the similarities to produce a unified value as the intensity of the dependency. We evaluate AID on the data collected from an open-source microservice benchmark and a cloud system in production. The experimental results show that AID can efficiently and accurately predict the intensity of dependencies. We further demonstrate the usefulness of our method in a large-scale commercial cloud system.

翻译：云端服务可靠性是云端供应商必须应对的关键挑战之一。在云层系统中,计划外服务故障可能会对其依赖者服务造成严重的连锁影响,使客户满意度下降。准确和高效地预测层层层影响对于云层系统的运作和维护至关重要。现有办法确定一个服务是否依赖另一个服务,通过分布追踪确定,但没有先前的工作侧重于区分云层服务之间的依赖程度。在本文件中,我们调查两个云端供应商的断流和故障诊断程序,以激励对依赖度的界定。我们将两个服务之间的依赖程度定义为受访者服务对呼叫者服务的影响程度。然后我们提出AID,这是预测云层服务依赖程度的第一个办法。AID首先从各个范围产生一套候选依赖性配对。AID随后代表了每个云层服务的状况,从跨度中汇总了一个多变时间序列。随着服务的代表性,AID计算了两个服务对象的地位和每个受访者之间的相似性。最后,AID总体预测了云层值,而AID的准确性估算了我们系统的统一性估算了A类的可靠度。AID的可靠度,从A类的可靠度的可靠度数据采集数据展示了我们所收集的可靠度。