The paradigm of big data is characterized by the need to collect and process data sets of great volume, arriving at the systems with great velocity, in a variety of formats. Spark is a widely used big data processing system that can be integrated with Hadoop to provide powerful abstractions to developers, such as distributed storage through HDFS and resource management through YARN. When all the required configurations are made, Spark can also provide quality attributes, such as scalability, fault tolerance, and security. However, all of these benefits come at the cost of complexity, with high memory requirements, and additional latency in processing. An alternative approach is to use a lean software stack, like Unicage, that delegates most control back to the developer. In this work we evaluated the performance of big data processing with Spark versus Unicage, in a cluster environment hosted in the IBM Cloud. Two sets of experiments were performed: batch processing of unstructured data sets, and query processing of structured data sets. The input data sets were of significant size, ranging from 64 GB to 8192 GB in volume. The results show that the performance of Unicage scripts is superior to Spark for search workloads like grep and select, but that the abstractions of distributed storage and resource management from the Hadoop stack enable Spark to execute workloads with inter-record dependencies, such as sort and join, with correct outputs.
翻译:大数据模式的特点是需要收集和处理数量庞大的数据集,以多种格式以高速高速到达系统。闪光是一个广泛使用的大数据处理系统,可以与Hadoop合并,为开发者提供强大的抽象信息,例如通过HDFS进行分布式存储,并通过YARN进行资源管理。当所有所需的配置完成时,Spark也可以提供质量属性,如可缩放性、错容度和安全等。然而,所有这些效益都是以复杂、记忆要求高、处理时间较长等费用为代价的。另一种办法是使用精细的软件库,如Unicaage,代表最能控制开发者。在这项工作中,我们评估了在IBM云中托管的集群环境中,与Spark对Unicage进行大数据处理的绩效。进行了两套实验:分批处理未结构化的数据集,以及结构化数据集的查询处理。输入数据可靠程度很大,从64GB到892GB数量不等。一个办法是使用精细的软件库,代表最能控制开发者。在IBM云中评估大的数据处理过程的绩效,例如Sramp Stamp Stamp Stamp Stamp 和Sprill Slipple 等版本的版本的运行的运行,从而使得能够进行正常的运行。