项目名称: 大规模分布式系统中服务失效的自动诊断方法研究
项目编号: No.61303053
项目类型: 青年科学基金项目
立项/批准年度: 2014
项目学科: 自动化技术、计算机技术
项目作者: 李丰
作者单位: 中国科学院计算技术研究所
项目金额: 23万元
中文摘要: 随着大规模分布式系统的发展,尤其是云计算的兴起,失效的表现形式、原因以及传播形式均呈现出新的特点,进一步加重了识别失效和定位失效原因的负担。 本申请针对大规模分布式系统中涉及服务质量下降的失效,研究自动诊断方法。研究内容包括:(1)提出采样率、跟踪粒度均可双向调节的自适应跟踪策略,并基于该策略研究失效模式的自动提取与持续精化技术,支持对服务失效的自动识别;对该技术的研究以控制跟踪开销、提高失效识别精度和方法的可伸缩性为目标;(2)研究失效原因的自动定位技术:首先,研究失效相关的因素以及量化评估各因素对失效贡献的模型;然后,根据对失效贡献率的计算结果,研究基于推导、分治验证交替迭代的失效原因自动定位方法;对该技术的研究以自动且准确地定位失效原因为目标。上述研究将为大规模分布式系统部署后服务失效的诊断提供方法与关键技术,及时、准确地识别服务失效的表现及失效原因,提高系统的可靠性与服务质量。
中文关键词: 大规模分布式系统;失效诊断;缺陷定位;查询优化;静态检测
英文摘要: With the development of large-scale distributed system, especially with the rise of cloud computing, failures appear more frequently. Effort spent on failure diagnosis has also been increased since both the types and the root causes of failures become more diverse and complex. This proposal presents a study on automatic service failure diagnosis in large-scale distributed systems. A service failure refers to the type of failures which makes the systems perform poorly or run far slower than expection. There are two main research topics in the proposal: (1) automtic failure model extraction based on adaptive tracing, and (2) automatic fault localization based on derivation and verification. The goal of our first research topic is to improve both the accuracy and the scalability of failure detection while keeping the cost of tracing low. To achieve this goal, we plan to present a study on adaptive end-to-end tracing which adjusts both the sampling rates and the granularities of online tracing. In this study, failure models will first be extracted and refined on the basis of the tracing results,and then be used to guide the tracing strategies in turn. The goal of our second research topic is to improve both the accuracy and the efficiency of fault localization (i.e. locating the root causes of service failures). Our
英文关键词: Large-scale distributed system;failure diagnosis;fault localization;query optimization;static detection