We are living in the era of Big Data and witnessing the explosion of data. Given that the limitation of CPU and I/O in a single computer, the mainstream approach to scalability is to distribute computations among a large number of processing nodes in a cluster or cloud. This paradigm gives rise to the term of data-intensive computing, which denotes a data parallel approach to process massive volume of data. Through the efforts of different disciplines, several promising programming models and a few platforms have been proposed for data-intensive computing, such as MapReduce, Hadoop, Apache Spark and Dyrad. Even though a large body of research work has being proposed to improve overall performance of these platforms, there is still a gap between the actual performance demand and the capability of current commodity systems. This paper is aimed to provide a comprehensive understanding about current semantics-aware approaches to improve the performance of data-intensive computing. We first introduce common characteristics and paradigm shifts in the evolution of data-intensive computing, as well as contemporary programming models and technologies. We then propose four kinds of performance defects and survey the state-of-the-art semantics-aware techniques. Finally, we discuss the research challenges and opportunities in the field of semantics-aware performance optimization for data-intensive computing.
翻译:我们生活在“大数据”时代,目睹了数据爆炸。鉴于CPU和I/O在单一计算机中的局限性,可扩展性的主流办法是在集群或云层中大量处理节点之间分配计算结果。这一模式产生了数据密集计算这一术语,这意味着对处理大量数据采取数据平行方法。通过不同学科的努力,为数据密集计算提出了几个有希望的编程模型和几个平台,如MapRduce、Hadoop、Apache Spark和Dyrad。尽管提议了大量研究工作以改善这些平台的总体性能,但实际性能需求与当前商品系统的能力之间仍然存在差距。本文旨在全面了解当前改进数据密集计算工作绩效的语义识别方法。我们首先在数据密集计算以及当代编程模型和技术的演变中引入了共同特点和范式变化。我们随后提出了四种绩效缺陷,并调查了当前精度精度精度精度优化技术的实地研究机会。最后,我们讨论了数据优化技术的实地研究。