项目名称: 面向E级计算可靠性墙问题的关键技术研究
项目编号: No.61303068
项目类型: 青年科学基金项目
立项/批准年度: 2014
项目学科: 自动化技术、计算机技术
项目作者: 王之元
作者单位: 中国人民解放军国防科学技术大学
项目金额: 23万元
中文摘要: 当前,高性能计算系统都采用并行处理方式显著地提高系统性能。随着系统规模的增长,尤其是增长到E级(百万万亿次)计算规模时,可靠性墙是其面临的巨大挑战性问题。因此,为了缓解或消除可靠性墙问题,本项目计划基于课题组在计算机系统结构、容错技术等方面的研究成果和技术积累,面向未来E级计算的高效能需求,研究可靠性墙瓶颈模型和理论,以及轻量级检查点/恢复、基于硬件冗余的可扩展容错和基于应用特征的容错技术,并运用软硬件验证平台对上述模型、理论和技术进行验证,以实现可扩展的系统容错,保证未来E级计算系统的高效持续运行。本项目计划发表高水平学术论文8篇以上,参加国际学术会议2人次以上,培养研究生4-6人。
中文关键词: 可靠性墙;容错;错误传播;故障预测;自治容错忆阻器
英文摘要: Currently, system performance of high-performance computers is increased all by parallel processing. With the growth of system size, especially when the computation scale reaches Exascale levels, reliability wall constitutes the great challenge for the high-performance computers. This project is based on the research and technology accumulation of our team on computer architecture and fault tolerance, and to meet the need of high efficiency Exascale computing. To alleviate or remove the reliability wall problem, this project aims to research the model and theory of reliability wall, lightweight checkpoint/restoration, scalable fault tolerance mechanism based on hardware redundancy and fault tolerance mechanism based on application features. In addition, this project will verify above mentioned model, theory and techniques using hardware/software verification platform, to achieve scalable fault tolerance technology and guarantee the efficient operation of Exa-scale computing systems. This project expects to publish more than 8 high-level academic articles, attend international conference more than 2 times, and train 4-6 graduate students.
英文关键词: Reliability Wall;Fault Tolerance;Error Propagation;Failure Prediction;Self -Fault Tolerance of Memristor