用于分配数据处理的中间索引化缓存 (In-Memory Indexed Caching for Distributed Data Processing)

Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations. Moreover, it supports appends with multi-version concurrency control. We implement the Indexed DataFrame as a lightweight, standalone library which can be integrated with minimum effort in existing Spark programs. We analyze the performance of the Indexed DataFrame in cluster and cloud deployments with real-world datasets and benchmarks using both Apache Spark and Databricks Runtime. In our evaluation, we show that the Indexed DataFrame significantly speeds-up query execution when compared to a non-indexed dataframe, incurring modest memory overhead.

翻译：数据框架等强大的抽象数据只有其基本运行时间系统的效率。 de-facto 分布式数据处理框架 Apache Spark 由于其过时的假设,不适合现代基于云的基于数据-科学工作量:使用粗糙的变形分析的静态数据集。在本文中,我们引入了索引化数据Frame,这是一个支持数据框架抽象功能的模拟缓存,它包含支持快速查看和合并操作的索引化能力。此外,它支持具有多版本调值控制的附加软件。我们把索引化数据Frame作为一个轻量的、独立的图书馆,可以与现有的闪烁程序的最低努力结合起来。我们用Apache Spark 和 Databricks Runtime 来分析集集和云部署中的索引化数据Frame的性能和基准。我们的评估显示,与非索引化数据框架相比,索引化数据框架的快速查询执行速度非常快,产生微的内存管理。

相关内容

Spark

关注 51

Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎。Spark是UC Berkeley AMP lab (加州大学伯克利分校的AMP实验室)所开源的类Hadoop MapReduce的通用并行框架，Spark，拥有Hadoop MapReduce所具有的优点；但不同于MapReduce的是Job中间输出结果可以保存在内存中，从而不再需要读写HDFS，因此Spark能更好地适用于数据挖掘与机器学习等需要迭代的MapReduce的算法。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【开放书】卡耐基梅隆大学Elaine Shi 教授《Foundations of Distributed Consensus and Blockchains（分布式共识和区块链的基础）》150页pdf

专知会员服务

30+阅读 · 2022年2月22日

Python分布式计算，171页pdf，Distributed Computing with Python

专知会员服务

108+阅读 · 2020年5月3日

【医学图像处理中的因果性】52页ppt，Causality Matters in Medical Imaging

专知会员服务

60+阅读 · 2020年3月14日