Exploratory Data Analysis (EDA) is a crucial step in any data science project. However, existing Python libraries fall short in supporting data scientists to complete common EDA tasks for statistical modeling. Their API design is either too low level, which is optimized for plotting rather than EDA, or too high level, which is hard to specify more fine-grained EDA tasks. In response, we propose DataPrep.EDA, a novel task-centric EDA system in Python. DataPrep.EDA allows data scientists to declaratively specify a wide range of EDA tasks in different granularity with a single function call. We identify a number of challenges to implement DataPrep.EDA, and propose effective solutions to improve the scalability, usability, customizability of the system. In particular, we discuss some lessons learned from using Dask to build the data processing pipelines for EDA tasks and describe our approaches to accelerate the pipelines. We conduct extensive experiments to compare DataPrep.EDA with Pandas-profiling, the state-of-the-art EDA system in Python. The experiments show that DataPrep.EDA significantly outperforms Pandas-profiling in terms of both speed and user experience. DataPrep.EDA is open-sourced as an EDA component of DataPrep: https://github.com/sfu-db/dataprep.
翻译:数据探索数据分析(EDA)是任何数据科学项目的关键步骤。然而,现有的Python图书馆在支持数据科学家完成共同的EDA任务以进行统计建模方面做得不够。它们的API设计要么太低,为绘图优化,而不是为绘制EDA优化,要么太高,难以规定更细微的 EDA任务。作为回应,我们提议DataPrep.EDA,这是位于Python的一个新的以任务为中心的EDA系统。DataPrep.EDA允许数据科学家以单一功能调用,在不同的颗粒中明确指定广泛的EDA任务。我们确定了实施DataPrep.EDA的一些挑战,并提出了提高系统可扩展性、可使用性、可定制性的有效解决方案。特别是,我们讨论了从利用Dask为 EDA任务建立数据处理管道所汲取的一些经验教训,并描述了我们加快输油管的方法。我们进行了广泛的实验,将DPrep.EDA和Pand-art-art-art-art-EDA replain A 用户数据格式都显示EDA的快速数据。