云平台并行数据流程序的中间数据管理优化技术

项目名称： 云平台并行数据流程序的中间数据管理优化技术

项目编号： No.61202065

项目类型： 青年科学基金项目

立项/批准年度： 2013

项目学科： 计算机科学学科

项目作者： 刘杰

作者单位： 中国科学院软件研究所

项目金额： 23万元

中文摘要： 并行数据流编程框架如MapReduce、Dryad、Pig等被广泛应用于处理日益增长的数据。并行数据流程序执行时产生海量的中间数据，占用大量存储资源。同时，中间数据分布式产生，需要在大量节点间进行传输。另外，中间数据的容错机制也严重影响系统性能。此外，大量并行数据流程序的并发运行于云平台，对中间数据管理的任务调度和资源管理提出挑战。本课题充分考虑中间数据特殊读写模式、生命周期短、与应用语义密切相关的特点，在云平台背景下研究并行数据流程序的海量中间数据的存储、传输、容错的优化技术，包括：基于分布式协同缓存优化并行数据流程序的中间数据访问；QoS保障的中间数据传输调度策略；应用语义感知的中间数据容错策略。本课题将实现中间数据管理原型系统，并集成到Hadoop平台，同时通过实验评价其优化效果。本课题有利于优化基于并行数据流编程框架的云应用性能，并大大降低资源成本。

中文关键词： 大数据；MapReduce；中间数据；并行数据流程序；性能优化

英文摘要： Parallel dataflow programming frameworks (e.g., MapReduce, Dryad, Pig and Hive) have been widely applied in processing big data both in academy and industry. From the perspective of underlying system, a critical problem is how to manage the massive intermediate data which are generated during the execution of parallel dataflow programs. In practice, a large space of memory and disk will be occupied for storing intermediate data. Network overhead is common while transferring these massive intermediate data among distributed computing nodes. Fault-tolerant and fast recovery mechanisms are also needed to keep the high availability of these data. Furthermore, the sharing and elastic cloud platform increases the difficulty of intermediate data management since multiple parallel dataflow programs will run concurrently. How to store these diverse intermediate data from different programs and how to schedule the transferring of these data under different Service Level Agreements are two major challenges. This project aims to develop a new specific intermediate data management system considering various aspects such as data, resource utilization, fault-tolerance and performance. Firstly, we need to understand and delve into the characteristics of intermediate data. They are almost short-lived and accessed in the form of

英文关键词： big data；MapReduce；intermediate data；parallel dataflow program；performance optimization

成为VIP会员查看完整内容

相关内容

大数据

关注 270

从各种各样类型的数据中，快速获得有价值信息的能力，就是大数据技术。明白这一点至关重要，也正是这一点促使该技术具备走向众多企业的潜力。大数据的4个“V”，或者说特点有四个层面：第一，数据体量巨大。从TB级别，跃升到PB级别；第二，数据类型繁多。前文提到的网络日志、视频、图片、地理位置信息等等。第三，价值密度低。以视频为例，连续不间断监控过程中，可能有用的数据仅仅有一两秒。第四，处理速度快。

面向大数据处理框架的JVM优化技术综述

专知会员服务

17+阅读 · 2021年11月27日

移动互联网应用程序（APP）个人信息保护治理白皮书

专知会员服务

21+阅读 · 2021年11月24日

【清华大学陈游旻博士论文】持久性内存存储系统关键技术研究

专知会员服务

29+阅读 · 2021年11月24日

数据中心传感器技术应用白皮书

专知会员服务

44+阅读 · 2021年11月13日