面向百万万亿次高性能计算系统的容错计算模型研究

项目名称： 面向百万万亿次高性能计算系统的容错计算模型研究

项目编号： No.61272142

项目类型： 面上项目

立项/批准年度： 2013

项目学科： 自动化技术、计算机技术

项目作者： 卢凯

作者单位： 中国人民解放军国防科学技术大学

项目金额： 72万元

中文摘要： 现有并行计算模型不具备容错计算能力，需要借助检查点等外部容错技术才能实现持续计算，性能开销大，系统有效利用率低，无法满足未来百万万亿次高性能计算系统的运行需求。本项目基于新型非易失存储技术（NVRAM），面向未来百万万亿次高性能计算容错需求，研究新的具备容错计算能力的并行计算模型。该模型改变了传统并行计算模型中依赖操作系统提供应用运行环境的设计思想，采取了系统服务和运行环境相分离的运行模式。通过研究基于NVRAM的分类存储模型和管理策略，设计新的上下文自包含的和支持原地恢复的非易失容错进程模型，新容错并行计算模型可以将用户应用的完整运行状态实时驻留在NVRAM中。并通过研究支持并行稳态运行的新型执行方式，支持用户应用原地快速恢复和持续执行。容错并行计算模型可有效克服传统检查点容错技术性能开销大，系统利用率低等问题，满足未来百万万亿次高性能计算的容错需求。

中文关键词： 非易失存储器件；存储管理；稳态执行；容错；进程模型

英文摘要： In Exa-scale High Performance Computer system design, the reliability of whole system is a serious problem we have to face, because researchers predict that the MBTF of Exa-scale HPC systems will be less than half an hour. Currentyl, how to provide a highly availble computing environment presents a great chanllenge. Due to the current parallel computing model lack of fault-tolerant ability, we have to rely on external fault-tolerant teniques, such as Checkpoint/Restart techniques, to improve the reliability of HPC systems. Checkpoint/Restart techniques record the running states of parallel application periodly, and resumes the execution from checkpoint file after the HPC system fails. However, with the MTBF of HPC system decreasing, the large overhead of checkpoint/Restart will induce the utility of Exa-scale HPC systems to be very low. Thus, Checkpoint/Restart techniques can't meet the requirements of Exa-scale HPC systems. In this project, we propose a new fault-tolerant parallel computing model for the Exa-scale HPC system. In the new fault-tolerant computing model, we takes the advantage the emerging non-volatile memory technique(NVRAM) to provide sustainable memory storage support. The new fault-tolerant computing model:1) decouples the process from operating system and redesigns the OS only to be servic

英文关键词： Non-volatile memory；memory management；Fault tolerance；Process model；

成为VIP会员查看完整内容