The data science community today has embraced the concept of Dataframes as the de facto standard for data representation and manipulation. Ease of use, massive operator coverage, and popularization of R and Python languages have heavily influenced this transformation. However, most widely used serial Dataframes today (R, pandas) experience performance limitations even while working on even moderately large data sets. We believe that there is plenty of room for improvement by investigating the generic distributed patterns of dataframe operators. In this paper, we propose a framework that lays the foundation for building high performance distributed-memory parallel dataframe systems based on these parallel processing patterns. We also present Cylon, as a reference runtime implementation. We demonstrate how this framework has enabled Cylon achieving scalable high performance. We also underline the flexibility of the proposed API and the extensibility of the framework on different hardware. To the best of our knowledge, Cylon is the first and only distributed-memory parallel dataframe system available today.
翻译:数据科学界今天接受数据框架的概念,认为它是数据代表性和操作的实际标准; 使用方便、大规模操作范围以及R语和Python语的普及在很大程度上影响了这一转变; 然而,今天最广泛使用的序列数据框架(R、pandas)在努力开发即使是中等规模的数据集时也经历了业绩限制; 我们认为,通过调查数据框架操作者的一般分布模式,有很大的改进余地; 在本文件中,我们提出了一个框架,为在这些平行处理模式的基础上建立高性能分布式平行数据框架系统奠定基础; 我们还将锡隆作为参考运行时间加以介绍; 我们展示了该框架如何使Cylon能够实现可扩展的高性业绩; 我们还强调拟议的API的灵活性和不同硬件框架的可扩展性。 据我们所知,Cylon是当今第一个而且仅提供分布式平行数据框架系统。