项目名称: 大规模计算系统故障的主动检测技术研究
项目编号: No.60803045
项目类型: 青年科学基金项目
立项/批准年度: 2009
项目学科: 金属学与金属工艺
项目作者: 武林平
作者单位: 北京应用物理与计算数学研究所
项目金额: 21万元
中文摘要: 围绕大规模计算系统的可靠性问题,本项目从"主动故障"管理的思路出发,针对三个问题展开研究:1)大规模计算系统的可靠性现状及故障特征分析;2)面向主动故障管理的大规模计算系统状态监控机制;3)基于运行时状态信息的故障主动检测、隔离方法。 针对第一个问题,通过分析国内外若干大规模计算系统的实际运行数据,总结能力型计算系统的故障特征,从故障原因、故障传播机制、故障管理策略三方面展开研究。这些研究结果可用于系统研制、系统运行管理的参考。 针对第二个问题,从主动故障管理的需求出发,提出并实现大规模计算系统的"多模式"监控系统。该监控系统在我单位某生产性计算平台部署后,取得很好效果。 针对第三个问题,结合多模式监控系统,把系统运行时状态数据作为输入,通过聚类分析得到故障特征信号;通过分析故障传播路径,提出基于隔离思想的故障在线自动处理机制。 围绕上述研究内容,已发表7篇学术论文,参加一次国际会议、两次国内会议。
中文关键词: 超级计算机;主动故障管理;多模式监控;容错;故障特征
英文摘要: To solve the availability and dependability of large scale computing systems, this project research on proactive fault management mechanism about three main problems: 1) The characteristic of failures and the Fault Tolerance scheme for Large scale Computing Systems; 2) The run time monitoring method for the proactive fault management; 3) The proactive fault detection and faulty hardware isolation method based on the run time monitoring data. For the first question, we summaries the main fault models and features according to the public fault data of several supercomputers. The rearch results include the root cause of faults, the fault propagation path and the fault management strategy. The results can be used in the course of system development and run time management. For the second question, the multi-view monitoring strategy is introduced in this project firstly. The multi-view monitoring system has been deployed at one of our production supercomputer and improve the reliability and decrease the failure rate of parallel jobs. For the third question, the normal activities for nodes in HPC cluster are modeled using runtime state by clustering analysis; based on the fault propagation path analysis, we develop the online fault isolated method. We have published 7 papers about above research results, take part in one international conference and two internal conferences.
英文关键词: Supercomputer; Proactive fault management; Multi-View Monitoring; fault tolerance; Characteristic of failures