大数据需要复杂的系统吗? 火花和单层壳牌弹壳的实绩比较脚本 (Does Big Data Require Complex Systems? A Performance Comparison Between Spark and Unicage Shell Scripts)

The paradigm of big data is characterized by the need to collect and process data sets of great volume, arriving at the systems with great velocity, in a variety of formats. Spark is a widely used big data processing system that can be integrated with Hadoop to provide powerful abstractions to developers, such as distributed storage through HDFS and resource management through YARN. When all the required configurations are made, Spark can also provide quality attributes, such as scalability, fault tolerance, and security. However, all of these benefits come at the cost of complexity, with high memory requirements, and additional latency in processing. An alternative approach is to use a lean software stack, like Unicage, that delegates most control back to the developer. In this work we evaluated the performance of big data processing with Spark versus Unicage, in a cluster environment hosted in the IBM Cloud. Two sets of experiments were performed: batch processing of unstructured data sets, and query processing of structured data sets. The input data sets were of significant size, ranging from 64 GB to 8192 GB in volume. The results show that the performance of Unicage scripts is superior to Spark for search workloads like grep and select, but that the abstractions of distributed storage and resource management from the Hadoop stack enable Spark to execute workloads with inter-record dependencies, such as sort and join, with correct outputs.

翻译：大数据模式的特点是需要收集和处理数量庞大的数据集,以多种格式以高速高速到达系统。闪光是一个广泛使用的大数据处理系统,可以与Hadoop合并,为开发者提供强大的抽象信息,例如通过HDFS进行分布式存储,并通过YARN进行资源管理。当所有所需的配置完成时,Spark也可以提供质量属性,如可缩放性、错容度和安全等。然而,所有这些效益都是以复杂、记忆要求高、处理时间较长等费用为代价的。另一种办法是使用精细的软件库,如Unicaage,代表最能控制开发者。在这项工作中,我们评估了在IBM云中托管的集群环境中,与Spark对Unicage进行大数据处理的绩效。进行了两套实验:分批处理未结构化的数据集,以及结构化数据集的查询处理。输入数据可靠程度很大,从64GB到892GB数量不等。一个办法是使用精细的软件库,代表最能控制开发者。在IBM云中评估大的数据处理过程的绩效,例如Sramp Stamp Stamp Stamp Stamp 和Sprill Slipple 等版本的版本的运行的运行,从而使得能够进行正常的运行。

相关内容

Spark

关注 51

Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎。Spark是UC Berkeley AMP lab (加州大学伯克利分校的AMP实验室)所开源的类Hadoop MapReduce的通用并行框架，Spark，拥有Hadoop MapReduce所具有的优点；但不同于MapReduce的是Job中间输出结果可以保存在内存中，从而不再需要读写HDFS，因此Spark能更好地适用于数据挖掘与机器学习等需要迭代的MapReduce的算法。

【超赞的#C++#速查&信息图】“hacking c++ - Cheat Sheets & Infographics”

专知会员服务

30+阅读 · 2022年3月8日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日