As architecture, systems, and data management communities pay greater attention to innovative big data systems and architectures, the pressure of benchmarking and evaluating these systems rises. Considering the broad use of big data systems, big data benchmarks must include diversity of data and workloads. Most of the state-of-the-art big data benchmarking efforts target evaluating specific types of applications or system software stacks, and hence they are not qualified for serving the purposes mentioned above. This paper presents our joint research efforts on this issue with several industrial partners. Our big data benchmark suite BigDataBench not only covers broad application scenarios, but also includes diverse and representative data sets. BigDataBench is publicly available from http://prof.ict.ac.cn/BigDataBench . Also, we comprehensively characterize 19 big data workloads included in BigDataBench with varying data inputs. On a typical state-of-practice processor, Intel Xeon E5645, we have the following observations: First, in comparison with the traditional benchmarks: including PARSEC, HPCC, and SPECCPU, big data applications have very low operation intensity; Second, the volume of data input has non-negligible impact on micro-architecture characteristics, which may impose challenges for simulation-based big data architecture research; Last but not least, corroborating the observations in CloudSuite and DCBench (which use smaller data inputs), we find that the numbers of L1 instruction cache misses per 1000 instructions of the big data applications are higher than in the traditional benchmarks; also, we find that L3 caches are effective for the big data applications, corroborating the observation in DCBench.
翻译:由于结构、系统和数据管理界更加关注创新型的大数据系统和结构,因此,衡量和评估这些系统的压力会增加。考虑到大数据系统的广泛使用,大数据基准必须包括数据和工作量的多样性。大多数最先进的大数据基准工作都以评价特定类型的应用程序或系统软件堆积为目标,因此它们不具备满足上述目的的资格。本文介绍了我们与几个工业伙伴就此问题开展的联合研究工作。我们的大数据基准套件BigDataBench不仅涵盖广泛的应用设想,而且还包括多样化和具有代表性的数据集。BigDataBench在http://prof.ict.ac.cn/BigDataBench的应用中公开提供的数据基准必须包括数据和工作量的多样性。此外,我们全面描述大数据库或系统软件堆集中包含的19个大数据工作量,因此它们不具备满足上述目的的资格。 在典型的实践状态处理器中,Intel Xeon E5645,我们得出以下观察意见:首先,与传统基准相比:包括我们PAREC、HPC、SPCPU、大数据应用的更高和最有代表性的指令的数据集系,在LBE-BS的精确观测中具有非常低的操作性数据应用,对数据应用的深度数据应用进行不甚低的深度数据;第二个数据数据分析数据,对大数据数据数据数据数据分析的精确度可能施加影响。