项目名称: 基于在线机器学习的超级计算机主动容错技术研究
项目编号: No.61272141
项目类型: 面上项目
立项/批准年度: 2013
项目学科: 自动化技术、计算机技术
项目作者: 蒋艳凰
作者单位: 中国人民解放军国防科学技术大学
项目金额: 81万元
中文摘要: 超级计算机正由当前的P级计算向E级计算迈进,专家预计E级计算系统的平均无故障时间仅有几十分钟,采用传统的被动容错方法因容错开销太大,将无法满足未来E级计算系统可用性的需求。主动容错利用故障预测技术提前对可能的故障进行处理,是提高系统可用性的重要途径。针对未来超级计算机系统面临的可靠性问题,本项目提出主被动容错相结合的容错策略,故障预测是该容错策略的关键。通过对各结点状态的实时获取与在线挖掘,获取各种故障的发生规律,然后利用学习的结果对系统故障进行预测,并对即将发生的故障实施低开销的主动容错,从而提高超级计算机的可用性。主要研究内容包括:故障在线学习与预测模型、系统状态数据的获取与预处理、故障在线学习方法、故障实时预测策略、故障规则获取技术、主动容错方法等。项目研究的目标是提高超级计算机的故障在线预测能力,降低系统容错开销,保证大规模并行应用的高效持续运行。
中文关键词: 在线学习;主动容错;高性能计算;故障预测;
英文摘要: Supercomputers are advancing from Petascale computing to Exascale computing, the MTBF of the future Exascale computing system will down to only several tens of minutes. Because of heavy overhead, traditional passive-fault-tolerant techniques will not satisfy the need for the usability of the future supercomputers any more. By applying of failure prediction, active fault tolerant can deal with system faults before the faults happen. It becomes an important way to improve the usability for the future supercomputers. This project combines active and passive fault tolerant techniques, where on-line failure prediction is the key part of the strategy. The state of each computing node is acquired in real time, and the rules of system faults can be analyzed and learned from the state data. Then the learned results can be applied to predict the faults of the supercomputer. For the predicted faults, active fault tolerant methods will be actived before the faults realy happen. This proposal concerns researches of learning and prediction model, state acquisition, on-line learning algorithms, failue prediction strategy, rules extraction for system fault, active fault tolerant methods etc. The project aims at improving the prediction accuracy for system fault and reducing the overhead of fault tolerant, so the efficiency and
英文关键词: on-line learning;active fault tolerant;high performance computing;failure prediction;