项目名称: 面向大规模高性能计算的低开销回卷恢复容错技术
项目编号: No.61272401
项目类型: 面上项目
立项/批准年度: 2013
项目学科: 自动化技术、计算机技术
项目作者: 杨金民
作者单位: 湖南大学
项目金额: 78万元
中文摘要: 高性能计算系统通过扩大计算结点规模来提升性能,带来了故障随结点规模呈指数增长的可靠性问题,要求有与之相应的容错支持。回卷恢复容错技术基于时间冗余来容错,无须结点冗余,适应了高性能计算的需求。但现有方法在设置进程检查点时单一地采取映像方式保存状态数据,故障恢复时以串行方式重演日志消息,其开销随系统规模增大而剧增。本项目研究进程检查点和进程重生的非对等特征,提出基于状态区分的进程检查点技术, 通过程序语义建模来解析进程状态的构成,采用对象特征值来置换其内存映像,以此减少检查点数据量,降低检查点开销;研究进程前滚和进程正常执行的非等同特征,提出基于并发重演的进程快速前滚技术,通过消息作用域估算来判定消息间的独立性,采用结果日志来解除消息间的依赖关系,以此提升消息重演的并发性,降低故障恢复开销。实现基于以上技术的容错支持库,解决开销随系统规模增大而剧增问题,为大规模高性能计算提供低开销的容错支持。
中文关键词: 高性能计算;回卷恢复;时间开销;状态区分;并发重演
英文摘要: More and more computing nodes are integrated into high performance computers to improve their performance, resulting in the problem that faults increase exponentially with the scale of nodes. In such a situation, fault-tolerance is necessary for system dependability. Unfortunately, fault tolerance often aggravates system in complexity by node redundancy, provoking more faults. Rollback recovery is a trustworthy and popular approach to fault tolerance in high performance computing, as it doesn't need node redundancy by employing time redundancy strategy. However, existing rollback recovery schemes show that their time overheads increase sharply with the scale of nodes, as they save process state at a checkpoint in a sole manner of memory mapping, and replay the logged messages in sequential pattern during the fault recovery. This project exploits the non-equivalency between process checkpoint and process renaissance in terms of times,and then proposes the technology of process checkpoint based on state distinctions. This technology will identify object components in a process by semantics modeling of program and data, and distinguish them into environment state and application state, then resolve the eigenvalue of environment state to displace it. The technology should decrease the size of checkpoint, leading to
英文关键词: high performance computing;rollback recovery;time overhead;state distinguishing;oncurrent replaying