项目名称: 面向CFD并行应用开发框架的高效容错方法研究
项目编号: No.61303071
项目类型: 青年科学基金项目
立项/批准年度: 2014
项目学科: 自动化技术、计算机技术
项目作者: 徐新海
作者单位: 中国人民解放军国防科学技术大学
项目金额: 25万元
中文摘要: 利用并行计算机对CFD应用进行模拟,已经得到学术界和工业界的广泛认可。然而,并行计算系统日益严重的可靠性问题,却严重制约了CFD方法的进一步发展。传统容错方法在应用于CFD应用时,存在易用性与效率之间的矛盾:一方面,为了便于使用,系统级容错方法引入大量容错开销,这是大规模并行CFD应用不可接受的;另一方面,为了降低容错开销,应用级容错对程序员提出更高要求,CFD应用领域的程序员难以胜任。本课题首次提出将容错方法嵌入面向CFD并行应用开发框架以设计高效容错方法的思想。借助框架高度抽象的组织结构,让CFD应用研发人员以类自然语言的方式配置各种容错方法;同时,利用框架提供的程序信息,指导高效容错方法的设计。我们将对面向CFD并行应用开发框架容错的组织结构、机制方法以及优化技术展开研究,最终设计实现一个切实可用的嵌入容错功能的CFD并行应用开发框架,从而解决或缓解CFD并行应用模拟的可靠性问题。
中文关键词: 容错;计算流体力学;框架;检查点;软错误
英文摘要: Simulating CFD applications with parallel computers has been widely acknowledged by academia and industry. However, the ever-deteriorating reliability problem of parallel computers has seriously constrained the further development of such method. When applying traditional fault-tolerance methods to CFD applications, we face the contradiction between usability and efficiency: on the one hand, for easy adoption, system-level fault-tolerance methods introduce huge overhead, which is unacceptable in large-scale parallel CFD applications; on the other hand, for reducing fault-tolerance overhead, application-level fault-tolerance methods raise higher demands for programmers, and those demands are beyond the capacity of the programmers in the CFD application fields. In order to solve this contradiction, this project for the first time proposes the idea of embedding fault-tolerance methods into the CFD parallel application development framework to develop high-efficient fault-tolerance methods. Our method takes advantage of the highly abstract organization of the framework, and enables CFD application researchers to configure various fault-tolerance methods in a manner similar to the natural language. Meanwhile, our method utilizes the program information provided by the framework to guide the design of high-efficient f
英文关键词: Fault Tolerance;CFD;Framework;Checkpointing;Soft Error