项目名称: 面向高性能云平台的并行程序优化关键技术研究
项目编号: No.61472201
项目类型: 面上项目
立项/批准年度: 2015
项目学科: 自动化技术、计算机技术
项目作者: 翟季冬
作者单位: 清华大学
项目金额: 84万元
中文摘要: 随着云计算的发展以及面向并行计算领域优化的高性能云平台的出现,越来越多的用户开始在高性能云平台上运行各种科学计算程序。但是,复杂的云平台计价模型、灵活的云资源配置模式、非定制的通信网络以及显著的系统噪音等因素给高性能云平台上运行大规模并行程序带来新的挑战。.针对上述问题,本项目研究工作包括:首先,提出面向高性能云平台的半弹性虚拟集群计算模型。通过聚合大量用户的作业请求,实现统一的云资源调度和管理,并根据作业规模动态调整虚拟集群大小,降低用户使用成本并提高作业运行效率。其次,提出基于学习排序的方法实现自动预测给定并行程序的最优云配置方案。针对云平台资源配置组合空间爆炸的问题,提出基于PB矩阵的统计方法对高维参数空间进行降维。最后,针对高性能云平台的特点,提出采用静态分析的技术实现并行程序的通信自动隐藏,以及基于性能断言的技术在线检测云平台上存在的系统噪音,提高并行程序的性能和可扩展性。
中文关键词: 高性能计算;云计算;并行程序;性能优化
英文摘要: With the development of cloud computing as well as the emergence of high performance computing cloud, more and more users begin to run a variety of parallel applications on such platforms. However, the complex cloud pricing models, flexible cloud resource allocation models, non-customized communication networks and significant system noise have brought new challenges to execute large-scale parallel applications on high performance cloud platforms. To solve these problems, this project focuses on the following points: First of all, we propose a semi-elastic virtual cluster computing model based on high performance computing cloud. Through aggregating the demands from multiple users, our system can achieve cloud resources provisioning with different types of reserved instances to optimize its overall cost effectiveness. It can also intelligently control the virtual cluster capacity and plan its resource distribution across different cloud pricing classes. Second, we propose utilizing learning-to-rank to perform black-box performance/cost predictions. To tackle the high-dimensional parameter exploration space unique to cloud platforms, we enable affordable, reusable, and incremental training guided by Plackett and Burman Matrices. Finally, according to the characteristics of high performance computing cloud, we propose using static analysis to automatically overlap communication and computation in parallel applications. We also propose using performance assertion to detect system noise on high performance computing cloud.
英文关键词: High Performance Computing;Cloud Computing;Parallel Programs;Performance Optimization