DataPrep.EDA: Python 统计建模任务中心探索数据分析 (DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python) - 专知论文

会员服务 ·

0

EDA · 统计量 · Extensibility · Python · Dask ·

2021 年 4 月 10 日

DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python

翻译：DataPrep.EDA: Python 统计建模任务中心探索数据分析

Jinglin Peng,Weiyuan Wu,Brandon Lockhart,Song Bian,Jing Nathan Yan,Linghao Xu,Zhixuan Chi,Jeffrey Rzeszotarski,Jiannan Wang

Exploratory Data Analysis (EDA) is a crucial step in any data science project. However, existing Python libraries fall short in supporting data scientists to complete common EDA tasks for statistical modeling. Their API design is either too low level, which is optimized for plotting rather than EDA, or too high level, which is hard to specify more fine-grained EDA tasks. In response, we propose DataPrep.EDA, a novel task-centric EDA system in Python. DataPrep.EDA allows data scientists to declaratively specify a wide range of EDA tasks in different granularity with a single function call. We identify a number of challenges to implement DataPrep.EDA, and propose effective solutions to improve the scalability, usability, customizability of the system. In particular, we discuss some lessons learned from using Dask to build the data processing pipelines for EDA tasks and describe our approaches to accelerate the pipelines. We conduct extensive experiments to compare DataPrep.EDA with Pandas-profiling, the state-of-the-art EDA system in Python. The experiments show that DataPrep.EDA significantly outperforms Pandas-profiling in terms of both speed and user experience. DataPrep.EDA is open-sourced as an EDA component of DataPrep: https://github.com/sfu-db/dataprep.

翻译：数据探索数据分析(EDA)是任何数据科学项目的关键步骤。然而,现有的Python图书馆在支持数据科学家完成共同的EDA任务以进行统计建模方面做得不够。它们的API设计要么太低,为绘图优化,而不是为绘制EDA优化,要么太高,难以规定更细微的 EDA任务。作为回应,我们提议DataPrep.EDA,这是位于Python的一个新的以任务为中心的EDA系统。DataPrep.EDA允许数据科学家以单一功能调用,在不同的颗粒中明确指定广泛的EDA任务。我们确定了实施DataPrep.EDA的一些挑战,并提出了提高系统可扩展性、可使用性、可定制性的有效解决方案。特别是,我们讨论了从利用Dask为 EDA任务建立数据处理管道所汲取的一些经验教训,并描述了我们加快输油管的方法。我们进行了广泛的实验,将DPrep.EDA和Pand-art-art-art-art-EDA replain A 用户数据格式都显示EDA的快速数据。

0

相关内容

EDA

电子设计自动化（英语：Electronic design automation，缩写：EDA）是指利用计算机辅助设计（CAD）软件，来完成超大规模集成电路（VLSI）芯片的功能设计、综合、验证、物理设计（包括布局、布线、版图、设计规则检查等）等流程的设计方式。

如何构建你的推荐系统？这份21页ppt教程为你讲解

如何构建你的推荐系统？这份21页ppt教程为你讲解

专知会员服务

65+阅读 · 2021年2月12日

【干货书】数据科学家统计实战，附代码与409页pdf

【干货书】数据科学家统计实战，附代码与409页pdf

专知会员服务

60+阅读 · 2020年11月6日

知识图谱推理，50页ppt，Salesforce首席科学家Richard Socher

知识图谱推理，50页ppt，Salesforce首席科学家Richard Socher

专知会员服务

111+阅读 · 2020年6月10日

商业数据分析，39页ppt

商业数据分析，39页ppt

专知会员服务

165+阅读 · 2020年6月2日

【实用书】掌握Python数据分析，282页pdf，Mastering Python Data Analysis

【实用书】掌握Python数据分析，282页pdf，Mastering Python Data Analysis

专知会员服务

103+阅读 · 2020年4月22日

【报告推荐】三维及超形体分析中的几何与数据学习（Geometry and Learning from Data in 3D and Beyond - Shape Analysis）

【报告推荐】三维及超形体分析中的几何与数据学习（Geometry and Learning from Data in 3D and Beyond - Shape Analysis）

专知会员服务

23+阅读 · 2019年11月10日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

【电子书推荐】Data Science with Python and Dask

【电子书推荐】Data Science with Python and Dask

专知会员服务

44+阅读 · 2019年6月1日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【学习】(Python)SVM数据分类

【学习】(Python)SVM数据分类

机器学习研究会

6+阅读 · 2017年10月15日

Adversarial Variational Bayes: Unifying VAE and GAN 代码

Adversarial Variational Bayes: Unifying VAE and GAN 代码

CreateAMind

7+阅读 · 2017年10月4日

【推荐】SVM实例教程

【推荐】SVM实例教程

机器学习研究会

17+阅读 · 2017年8月26日

【推荐】Python机器学习生态圈(Scikit-Learn相关项目)

【推荐】Python机器学习生态圈(Scikit-Learn相关项目)

机器学习研究会

6+阅读 · 2017年8月23日

手把手教你用LDA特征选择

手把手教你用LDA特征选择

AI研习社

12+阅读 · 2017年8月21日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

用 Python 进行贝叶斯模型建模（1）

用 Python 进行贝叶斯模型建模（1）

Python开发者

3+阅读 · 2017年7月11日

Mapping Firms' Locations in Technological Space: A Topological Analysis of Patent Statistics

Mapping Firms' Locations in Technological Space: A Topological Analysis of Patent Statistics

Arxiv

0+阅读 · 2021年6月3日

HePPCAT: Probabilistic PCA for Data with Heteroscedastic Noise

Arxiv

0+阅读 · 2021年6月3日

Rectangular Flows for Manifold Learning

Arxiv

0+阅读 · 2021年6月2日

Post-mortem on a deep learning contest: a Simpson's paradox and the complementary roles of scale metrics versus shape metrics

Arxiv

0+阅读 · 2021年6月1日

Solving $k$-center Clustering (with Outliers) in MapReduce and Streaming, almost as Accurately as Sequentially

Arxiv

0+阅读 · 2021年6月1日

Deep learning: a statistical viewpoint

Arxiv

18+阅读 · 2021年3月16日

Time2Graph: Revisiting Time Series Modeling with Dynamic Shapelets

Arxiv

3+阅读 · 2020年11月30日

Collaborative Metric Learning Recommendation System: Application to Theatrical Movie Releases

Arxiv

7+阅读 · 2018年3月1日

End-to-End Fine-Grained Action Segmentation and Recognition Using Conditional Random Field Models and Discriminative Sparse Coding

Arxiv

4+阅读 · 2018年1月29日

TensorLog: Deep Learning Meets Probabilistic DBs

Arxiv

6+阅读 · 2017年7月17日

VIP会员

文章信息

相关主题

相关VIP内容

如何构建你的推荐系统？这份21页ppt教程为你讲解

如何构建你的推荐系统？这份21页ppt教程为你讲解

专知会员服务

65+阅读 · 2021年2月12日

【干货书】数据科学家统计实战，附代码与409页pdf

【干货书】数据科学家统计实战，附代码与409页pdf

专知会员服务

60+阅读 · 2020年11月6日

知识图谱推理，50页ppt，Salesforce首席科学家Richard Socher

知识图谱推理，50页ppt，Salesforce首席科学家Richard Socher

专知会员服务

111+阅读 · 2020年6月10日

商业数据分析，39页ppt

商业数据分析，39页ppt

专知会员服务

165+阅读 · 2020年6月2日

【实用书】掌握Python数据分析，282页pdf，Mastering Python Data Analysis

【实用书】掌握Python数据分析，282页pdf，Mastering Python Data Analysis

专知会员服务

103+阅读 · 2020年4月22日

【报告推荐】三维及超形体分析中的几何与数据学习（Geometry and Learning from Data in 3D and Beyond - Shape Analysis）

【报告推荐】三维及超形体分析中的几何与数据学习（Geometry and Learning from Data in 3D and Beyond - Shape Analysis）

专知会员服务

23+阅读 · 2019年11月10日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

【电子书推荐】Data Science with Python and Dask

【电子书推荐】Data Science with Python and Dask

专知会员服务

44+阅读 · 2019年6月1日

热门VIP内容

开通专知VIP会员享更多权益服务

【博士论文】低维与高维空间中潜在表征的分析、建模与变换

《生态建模密码破译：建模与编程实践》美陆军最新报告

大模型解决方案白皮书：社交陪伴场景全流程落地指南

面向具身操作的视觉-语言-动作模型综述

相关资讯

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【学习】(Python)SVM数据分类

【学习】(Python)SVM数据分类

机器学习研究会

6+阅读 · 2017年10月15日

Adversarial Variational Bayes: Unifying VAE and GAN 代码

Adversarial Variational Bayes: Unifying VAE and GAN 代码

CreateAMind

7+阅读 · 2017年10月4日

【推荐】SVM实例教程

【推荐】SVM实例教程

机器学习研究会

17+阅读 · 2017年8月26日

【推荐】Python机器学习生态圈(Scikit-Learn相关项目)

【推荐】Python机器学习生态圈(Scikit-Learn相关项目)

机器学习研究会

6+阅读 · 2017年8月23日

手把手教你用LDA特征选择

手把手教你用LDA特征选择

AI研习社

12+阅读 · 2017年8月21日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

用 Python 进行贝叶斯模型建模（1）

用 Python 进行贝叶斯模型建模（1）

Python开发者

3+阅读 · 2017年7月11日

相关论文

Mapping Firms' Locations in Technological Space: A Topological Analysis of Patent Statistics

Mapping Firms' Locations in Technological Space: A Topological Analysis of Patent Statistics

Arxiv

0+阅读 · 2021年6月3日

HePPCAT: Probabilistic PCA for Data with Heteroscedastic Noise

Arxiv

0+阅读 · 2021年6月3日

Rectangular Flows for Manifold Learning

Arxiv

0+阅读 · 2021年6月2日

Post-mortem on a deep learning contest: a Simpson's paradox and the complementary roles of scale metrics versus shape metrics

Arxiv

0+阅读 · 2021年6月1日

Solving $k$-center Clustering (with Outliers) in MapReduce and Streaming, almost as Accurately as Sequentially

Arxiv

0+阅读 · 2021年6月1日

Deep learning: a statistical viewpoint

Arxiv

18+阅读 · 2021年3月16日

Time2Graph: Revisiting Time Series Modeling with Dynamic Shapelets

Arxiv

3+阅读 · 2020年11月30日

Collaborative Metric Learning Recommendation System: Application to Theatrical Movie Releases

Arxiv

7+阅读 · 2018年3月1日

End-to-End Fine-Grained Action Segmentation and Recognition Using Conditional Random Field Models and Discriminative Sparse Coding

Arxiv

4+阅读 · 2018年1月29日

TensorLog: Deep Learning Meets Probabilistic DBs

Arxiv

6+阅读 · 2017年7月17日

微信扫码咨询专知VIP会员