项目名称: GPU通用计算系统检查点方法研究
项目编号: No.61272190
项目类型: 面上项目
立项/批准年度: 2013
项目学科: 自动化技术、计算机技术
项目作者: 陈浩
作者单位: 湖南大学
项目金额: 81万元
中文摘要: 本项目以GPU图形处理器在通用计算系统中的应用为背景,以提高GPU通用计算系统的可靠性为切入点,探索GPU程序高效检查点技术的实现方法与理论基础,使之满足GPU通用计算系统在高性能计算和超级计算中的理论和应用需要。以鲁棒性、高性能、透明性、灵活性为设计原则,将传统CPU检查点技术与GPU的体系结构特征结合起来,系统性地研究GPU核内检查点机制中的主要过程和关键问题,将增量存储、代码静态分析等技术融入到GPU核内状态的读取、保存和恢复过程中;对GPU硬件状态进行分析和建模,提取主要的特征参数,基于现有GPU通用计算软件开发框架,分析GPU程序内部语义,构建用户透明的检查点技术;研究GPU检查点技术在不同应用场景的应用,如虚拟机环境中的GPU计算任务在线迁移、GPU程序调试支持和自动错误诊断。
中文关键词: GPU通用计算;检查点;高性能计算;图计算;虚拟化
英文摘要: This project aims to improve the dependability of general purpose GPU (GPGPU) computing systems such as high performance computing and supercomputers, by exploring theoretical foundations and implementation techniques of high-efficient checkpointing mechanisms for GPGPU systems. This proposal consists of three major contributions. First, based on four design principles such as robustness, high-performance, transparency, and flexibility, we propose a novel checkpoint-inside-the-kernel mechanism for GPU kernels, which combines several techniques such as incremental storage and static analysis to aid the retrieval, record and recovery of GPU runtime states, and can be easily integrated into a conventional CPU-based checkpointing system. Second, based on existing GPGPU development frameworks, we propose to model GPU hardware intricacies by leveraging the inherent semantics of GPU programs, which has important implications for building a transparent GPU checkpointing system. Third, we further explore potential applications of GPU checkpointing mechanism in three typical scenarios: online task migration in virtual machines, debugging support and automatic failure diagnosis of GPU programs.
英文关键词: GPGPU;checkpointing;high performance computing;graph computing;virtualization