Apache Spark 零价、绿-无数据接口 (Zero-Cost, Arrow-Enabled Data Interface for Apache Spark)

Distributed data processing ecosystems are widespread and their components are highly specialized, such that efficient interoperability is urgent. Recently, Apache Arrow was chosen by the community to serve as a format mediator, providing efficient in-memory data representation. Arrow enables efficient data movement between data processing and storage engines, significantly improving interoperability and overall performance. In this work, we design a new zero-cost data interoperability layer between Apache Spark and Arrow-based data sources through the Arrow Dataset API. Our novel data interface helps separate the computation (Spark) and data (Arrow) layers. This enables practitioners to seamlessly use Spark to access data from all Arrow Dataset API-enabled data sources and frameworks. To benefit our community, we open-source our work and show that consuming data through Apache Arrow is zero-cost: our novel data interface is either on-par or more performant than native Spark.

翻译：分布式数据处理生态系统十分广泛,其组成部分高度专业化,因此迫切需要有效的互操作性。最近,社区选择阿帕奇箭头作为格式调解员,提供高效的模拟数据代表。箭头可以使数据处理引擎和储存引擎之间高效的数据流动,大大改进互操作性和总体性能。在这项工作中,我们通过箭头数据集API设计一个新的零成本互操作性数据层。我们的新数据界面有助于将计算(鼠标)和数据(箭头)层分离开来。这使实践者能够无缝地利用闪光来获取来自所有箭头数据集API驱动的数据源和框架的数据。为了造福我们社区,我们打开了我们的工作来源,并表明通过阿帕奇箭头消费数据是零成本的:我们的新数据界面要么是平行的,要么是比本地的Sparg。

相关内容

Spark

关注 51

Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎。Spark是UC Berkeley AMP lab (加州大学伯克利分校的AMP实验室)所开源的类Hadoop MapReduce的通用并行框架，Spark，拥有Hadoop MapReduce所具有的优点；但不同于MapReduce的是Job中间输出结果可以保存在内存中，从而不再需要读写HDFS，因此Spark能更好地适用于数据挖掘与机器学习等需要迭代的MapReduce的算法。

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

163+阅读 · 2019年10月12日

2019年机器学习框架回顾

专知会员服务

36+阅读 · 2019年10月11日

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日