BigDataBench:互联网服务大数据基准套件 (BigDataBench: a Big Data Benchmark Suite from Internet Services)

Lei Wang,Jianfeng Zhan,Chunjie Luo,Yuqing Zhu,Qiang Yang,Yongqiang He,Wanling Gao,Zhen Jia,Yingjie Shi,Shujie Zhang,Chen Zheng,Gang Lu,Kent Zhan,Xiaona Li,Bizhu Qiu

from arxiv, 12 pages, 6 figures, The 20th IEEE International Symposium On High Performance Computer Architecture (HPCA-2014), February 15-19, 2014, Orlando, Florida, USA

As architecture, systems, and data management communities pay greater attention to innovative big data systems and architectures, the pressure of benchmarking and evaluating these systems rises. Considering the broad use of big data systems, big data benchmarks must include diversity of data and workloads. Most of the state-of-the-art big data benchmarking efforts target evaluating specific types of applications or system software stacks, and hence they are not qualified for serving the purposes mentioned above. This paper presents our joint research efforts on this issue with several industrial partners. Our big data benchmark suite BigDataBench not only covers broad application scenarios, but also includes diverse and representative data sets. BigDataBench is publicly available from http://prof.ict.ac.cn/BigDataBench . Also, we comprehensively characterize 19 big data workloads included in BigDataBench with varying data inputs. On a typical state-of-practice processor, Intel Xeon E5645, we have the following observations: First, in comparison with the traditional benchmarks: including PARSEC, HPCC, and SPECCPU, big data applications have very low operation intensity; Second, the volume of data input has non-negligible impact on micro-architecture characteristics, which may impose challenges for simulation-based big data architecture research; Last but not least, corroborating the observations in CloudSuite and DCBench (which use smaller data inputs), we find that the numbers of L1 instruction cache misses per 1000 instructions of the big data applications are higher than in the traditional benchmarks; also, we find that L3 caches are effective for the big data applications, corroborating the observation in DCBench.

翻译：由于结构、系统和数据管理界更加关注创新型的大数据系统和结构,因此,衡量和评估这些系统的压力会增加。考虑到大数据系统的广泛使用,大数据基准必须包括数据和工作量的多样性。大多数最先进的大数据基准工作都以评价特定类型的应用程序或系统软件堆积为目标,因此它们不具备满足上述目的的资格。本文介绍了我们与几个工业伙伴就此问题开展的联合研究工作。我们的大数据基准套件BigDataBench不仅涵盖广泛的应用设想,而且还包括多样化和具有代表性的数据集。BigDataBench在http://prof.ict.ac.cn/BigDataBench的应用中公开提供的数据基准必须包括数据和工作量的多样性。此外,我们全面描述大数据库或系统软件堆集中包含的19个大数据工作量,因此它们不具备满足上述目的的资格。在典型的实践状态处理器中,Intel Xeon E5645,我们得出以下观察意见:首先,与传统基准相比:包括我们PAREC、HPC、SPCPU、大数据应用的更高和最有代表性的指令的数据集系,在LBE-BS的精确观测中具有非常低的操作性数据应用,对数据应用的深度数据应用进行不甚低的深度数据;第二个数据数据分析数据,对大数据数据数据数据数据分析的精确度可能施加影响。

相关内容

大数据

关注 270

从各种各样类型的数据中，快速获得有价值信息的能力，就是大数据技术。明白这一点至关重要，也正是这一点促使该技术具备走向众多企业的潜力。大数据的4个“V”，或者说特点有四个层面：第一，数据体量巨大。从TB级别，跃升到PB级别；第二，数据类型繁多。前文提到的网络日志、视频、图片、地理位置信息等等。第三，价值密度低。以视频为例，连续不间断监控过程中，可能有用的数据仅仅有一两秒。第四，处理速度快。