The data engineering and data science community has embraced the idea of using Python & R dataframes for regular applications. Driven by the big data revolution and artificial intelligence, these applications are now essential in order to process terabytes of data. They can easily exceed the capabilities of a single machine, but also demand significant developer time & effort. Therefore it is essential to design scalable dataframe solutions. There have been multiple attempts to tackle this problem, the most notable being the dataframe systems developed using distributed computing environments such as Dask and Ray. Even though Dask/Ray distributed computing features look very promising, we perceive that the Dask Dataframes/Ray Datasets still have room for optimization. In this paper, we present CylonFlow, an alternative distributed dataframe execution methodology that enables state-of-the-art performance and scalability on the same Dask/Ray infrastructure (thereby supercharging them!). To achieve this, we integrate a high performance dataframe system Cylon, which was originally based on an entirely different execution paradigm, into Dask and Ray. Our experiments show that on a pipeline of dataframe operators, CylonFlow achieves 30x more distributed performance than Dask Dataframes. Interestingly, it also enables superior sequential performance due to the native C++ execution of Cylon. We believe the success of Cylon & CylonFlow extends beyond the data engineering domain, and can be used to consolidate high performance computing and distributed computing ecosystems.
翻译:数据工程和数据科学界已经接受了使用 Python 和 R 数据框架进行常规应用的理念。 在大数据革命和人工智能的驱动下,这些应用现在对于处理百万字节的数据至关重要。 它们可以很容易地超过一个机器的能力, 但也需要大量开发者的时间和努力。 因此, 设计可缩放的数据框架解决方案至关重要。 已经多次尝试解决这个问题, 最值得注意的是利用分布式计算环境( 如 Dask 和 Ray)开发的数据框架系统。 尽管Dask/ Ray 分布式计算功能看起来非常有希望, 我们发现 Dask 数据框架/ Ray 数据集仍然有优化的空间。 在本文中, 我们介绍CylonFlow, 一种分布式数据框架执行方法, 使Dask/ Ray 基础设施的状态和可扩展性能。 为了实现这一目标, 我们整合了一个高性能数据框架系统Cylon系统, 最初基于完全不同的执行模式, 进入Dask 和Ray。 我们的实验显示Dy-F 运行流程运行者能够更精确的Clon 的C级运行流程。