项目名称: 云平台并行数据流程序的中间数据管理优化技术
项目编号: No.61202065
项目类型: 青年科学基金项目
立项/批准年度: 2013
项目学科: 计算机科学学科
项目作者: 刘杰
作者单位: 中国科学院软件研究所
项目金额: 23万元
中文摘要: 并行数据流编程框架如MapReduce、Dryad、Pig等被广泛应用于处理日益增长的数据。并行数据流程序执行时产生海量的中间数据,占用大量存储资源。同时,中间数据分布式产生,需要在大量节点间进行传输。另外,中间数据的容错机制也严重影响系统性能。此外,大量并行数据流程序的并发运行于云平台,对中间数据管理的任务调度和资源管理提出挑战。 本课题充分考虑中间数据特殊读写模式、生命周期短、与应用语义密切相关的特点,在云平台背景下研究并行数据流程序的海量中间数据的存储、传输、容错的优化技术,包括:基于分布式协同缓存优化并行数据流程序的中间数据访问;QoS保障的中间数据传输调度策略;应用语义感知的中间数据容错策略。本课题将实现中间数据管理原型系统,并集成到Hadoop平台,同时通过实验评价其优化效果。本课题有利于优化基于并行数据流编程框架的云应用性能,并大大降低资源成本。
中文关键词: 大数据;MapReduce;中间数据;并行数据流程序;性能优化
英文摘要: Parallel dataflow programming frameworks (e.g., MapReduce, Dryad, Pig and Hive) have been widely applied in processing big data both in academy and industry. From the perspective of underlying system, a critical problem is how to manage the massive intermediate data which are generated during the execution of parallel dataflow programs. In practice, a large space of memory and disk will be occupied for storing intermediate data. Network overhead is common while transferring these massive intermediate data among distributed computing nodes. Fault-tolerant and fast recovery mechanisms are also needed to keep the high availability of these data. Furthermore, the sharing and elastic cloud platform increases the difficulty of intermediate data management since multiple parallel dataflow programs will run concurrently. How to store these diverse intermediate data from different programs and how to schedule the transferring of these data under different Service Level Agreements are two major challenges. This project aims to develop a new specific intermediate data management system considering various aspects such as data, resource utilization, fault-tolerance and performance. Firstly, we need to understand and delve into the characteristics of intermediate data. They are almost short-lived and accessed in the form of
英文关键词: big data;MapReduce;intermediate data;parallel dataflow program;performance optimization