项目名称: 容错并行程序设计模型的研究与实现
项目编号: No.61300011
项目类型: 青年科学基金项目
立项/批准年度: 2014
项目学科: 自动化技术、计算机技术
项目作者: 王一拙
作者单位: 北京理工大学
项目金额: 23万元
中文摘要: 本课题在计算机体系结构并行化的发展趋势和日益突出的容错需求两方面背景下提出,研究支持容错的并行程序设计模型。该模型以任务为基本单元进行调度、执行、错误检测和恢复,通过充分发掘并行性提高系统性能和降低容错开销。主要研究内容包括:1)任务粒度的错误检测和恢复机制,拟采用一种Buffer-Commit的计算模型支持瞬时错误的容忍,采用应用级无盘检查点实现永久错误的容忍,并研究对错误频发的计算单元的弃用算法;2)分层可扩展的任务调度框架,对多核集群系统,节点内采用容错的工作窃取调度策略,节点间采用工作窃取和工作共享相结合的自适应调度策略;3)任务划分,针对不同并行模式研究不同的初始划分方法,研究并行循环和分治应用在运行时的动态划分策略,以获得最佳的负载均衡,另外,对出错任务研究一种动态分割算法。总之,本课题在并行程序设计中融入对错误容忍的支持,兼顾系统性能和可靠性两个方面。
中文关键词: 并行程序设计;容错;任务并行;工作窃取调度;
英文摘要: The research is proposed under the backgrounds that computer architecture enters the parallelism age and the system reliability becomes an increasing concern. The proposal aims to develop a parallel programming model supporting fault tolerance. The task is the basic unit of scheduling, execution, fault detection and recovery in the proposed model which exploits task-level parallelism to achieve high performance and low overhead of fault tolerance. The proposed research focuses on: 1) Task grain fault detection and recovery. A buffer-commit computation model will be used for transient fault tolerance and application-level diskless checkpointing technique will be used for permanent fault tolerance. A discarding algorithm will be studied, which determines whether a processing element on which faults frequently occur should continue to be used or not. 2) A hierarchical and scalable task scheduling framework. For multi-core clusters, a fault-tolerant work-stealing scheduling scheme will be designed to exploit intra-node parallelism and support fault tolerance. An adaptive scheduling scheme which combines work-stealing and work-sharing will be used to exploit inter-node parallelism. 3) Task partitioning. The proposed model uses different initial partitioning approaches adaptively for three patterns of task parallelism
英文关键词: Parallel Programming;Fault Tolerance;Task Parallelism;Work-stealing Scheduling;