Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations. Moreover, it supports appends with multi-version concurrency control. We implement the Indexed DataFrame as a lightweight, standalone library which can be integrated with minimum effort in existing Spark programs. We analyze the performance of the Indexed DataFrame in cluster and cloud deployments with real-world datasets and benchmarks using both Apache Spark and Databricks Runtime. In our evaluation, we show that the Indexed DataFrame significantly speeds-up query execution when compared to a non-indexed dataframe, incurring modest memory overhead.
翻译:数据框架等强大的抽象数据只有其基本运行时间系统的效率。 de-facto 分布式数据处理框架 Apache Spark 由于其过时的假设,不适合现代基于云的基于数据-科学工作量:使用粗糙的变形分析的静态数据集。在本文中,我们引入了索引化数据Frame,这是一个支持数据框架抽象功能的模拟缓存,它包含支持快速查看和合并操作的索引化能力。此外,它支持具有多版本调值控制的附加软件。我们把索引化数据Frame作为一个轻量的、独立的图书馆,可以与现有的闪烁程序的最低努力结合起来。我们用Apache Spark 和 Databricks Runtime 来分析集集和云部署中的索引化数据Frame的性能和基准。我们的评估显示,与非索引化数据框架相比,索引化数据框架的快速查询执行速度非常快,产生微的内存管理。