Distributed data processing ecosystems are widespread and their components are highly specialized, such that efficient interoperability is urgent. Recently, Apache Arrow was chosen by the community to serve as a format mediator, providing efficient in-memory data representation. Arrow enables efficient data movement between data processing and storage engines, significantly improving interoperability and overall performance. In this work, we design a new zero-cost data interoperability layer between Apache Spark and Arrow-based data sources through the Arrow Dataset API. Our novel data interface helps separate the computation (Spark) and data (Arrow) layers. This enables practitioners to seamlessly use Spark to access data from all Arrow Dataset API-enabled data sources and frameworks. To benefit our community, we open-source our work and show that consuming data through Apache Arrow is zero-cost: our novel data interface is either on-par or more performant than native Spark.
翻译:分布式数据处理生态系统十分广泛,其组成部分高度专业化,因此迫切需要有效的互操作性。最近,社区选择阿帕奇箭头作为格式调解员,提供高效的模拟数据代表。箭头可以使数据处理引擎和储存引擎之间高效的数据流动,大大改进互操作性和总体性能。在这项工作中,我们通过箭头数据集API设计一个新的零成本互操作性数据层。我们的新数据界面有助于将计算(鼠标)和数据(箭头)层分离开来。这使实践者能够无缝地利用闪光来获取来自所有箭头数据集API驱动的数据源和框架的数据。为了造福我们社区,我们打开了我们的工作来源,并表明通过阿帕奇箭头消费数据是零成本的:我们的新数据界面要么是平行的,要么是比本地的Sparg。