Jupyter笔记本在数据科学项目中的特征挖掘 (Mining the Characteristics of Jupyter Notebooks in Data Science Projects) - 专知论文

会员服务 ·

0

Jupyter · 数据科学 · 笔记本电脑 · Kaggle · GitHub ·

2023 年 4 月 11 日

Mining the Characteristics of Jupyter Notebooks in Data Science Projects

翻译：Jupyter笔记本在数据科学项目中的特征挖掘

Morakot Choetkiertikul,Apirak Hoonlor,Chaiyong Ragkhitwetsagul,Siripen Pongpaichet,Thanwadee Sunetnanta,Tasha Settewong,Vacharavich Jiravatvanich,Urisayar Kaewpichai

Nowadays, numerous industries have exceptional demand for skills in data science, such as data analysis, data mining, and machine learning. The computational notebook (e.g., Jupyter Notebook) is a well-known data science tool adopted in practice. Kaggle and GitHub are two platforms where data science communities are used for knowledge-sharing, skill-practicing, and collaboration. While tutorials and guidelines for novice data science are available on both platforms, there is a low number of Jupyter Notebooks that received high numbers of votes from the community. The high-voted notebook is considered well-documented, easy to understand, and applies the best data science and software engineering practices. In this research, we aim to understand the characteristics of high-voted Jupyter Notebooks on Kaggle and the popular Jupyter Notebooks for data science projects on GitHub. We plan to mine and analyse the Jupyter Notebooks on both platforms. We will perform exploratory analytics, data visualization, and feature importances to understand the overall structure of these notebooks and to identify common patterns and best-practice features separating the low-voted and high-voted notebooks. Upon the completion of this research, the discovered insights can be applied as training guidelines for aspiring data scientists and machine learning practitioners looking to improve their performance from novice ranking Jupyter Notebook on Kaggle to a deployable project on GitHub.

翻译：如今，许多行业对数据科学技能（如数据分析、数据挖掘和机器学习）有着异常的需求。计算笔记本（如Jupyter Notebook）是一种在实践中被广泛采用的数据科学工具。 Kaggle 和 GitHub 是两个数据科学社区，用于知识共享、技能练习和协作。尽管 Kaggle 和 GitHub 上都有初学者数据科学的教程和指南，但收到社区高票评价的 Jupyter Notebooks 数量较少。高票笔记本通常被认为是文档详尽、易于理解、符合最佳数据科学和软件工程实践的笔记本。在这项研究中，我们旨在了解 Kaggle 上的高票 Jupyter Notebooks 特征和 GitHub 上数据科学项目中流行的 Jupyter Notebooks 特征。我们计划挖掘和分析这两个平台上的 Jupyter Notebooks。我们将进行探索性分析、数据可视化和特征重要性分析，以理解这些笔记本的整体结构，并识别将低票和高票笔记本区分开的常见模式和最佳实践特征。完成该研究后，发现的洞察可以作为指导初学数据科学家和机器学习从业者的培训指南，帮助他们从Kaggle的新手排名技能提高到在GitHub上发布的实际项目。

0

相关内容

Jupyter

Jupyter Notebook是以网页的形式打开，可以在网页页面中直接编写代码和运行代码，代码的运行结果也会直接在代码块下显示的程序。如在编程过程中需要编写说明文档，可在同一个页面中直接编写，便于作及时的说明和解释。

【2023新书】使用Python进行统计和数据可视化，554页pdf

【2023新书】使用Python进行统计和数据可视化，554页pdf

专知会员服务

130+阅读 · 2023年1月29日

【2022新书】Python数据分析第三版，与Pandas、NumPy和Jupyter进行数据争论

【2022新书】Python数据分析第三版，与Pandas、NumPy和Jupyter进行数据争论

专知会员服务

121+阅读 · 2022年10月16日

【2022新书】Python数据分析第三版，579页pdf

【2022新书】Python数据分析第三版，579页pdf

专知会员服务

252+阅读 · 2022年8月31日

【2022新书】Python数据科学导论，309页pdf

【2022新书】Python数据科学导论，309页pdf

专知会员服务

82+阅读 · 2022年8月6日

剑桥大学《数据科学: 原理与实践》课程，附PPT下载

剑桥大学《数据科学: 原理与实践》课程，附PPT下载

专知会员服务

54+阅读 · 2021年1月20日

【2020新书】数据科学与机器学习导论，220页pdf

【2020新书】数据科学与机器学习导论，220页pdf

专知会员服务

81+阅读 · 2020年9月14日

数据科学导论，54页ppt，Introduction to Data Science

数据科学导论，54页ppt，Introduction to Data Science

专知会员服务

42+阅读 · 2020年7月27日

【新书】Python机器学习实战，545页pdf，Practical Machine Learning with Python

【新书】Python机器学习实战，545页pdf，Practical Machine Learning with Python

专知会员服务

310+阅读 · 2020年2月26日

【新书】用Python六步掌握机器学习，第二版，469页pdf，使用Python进行预测数据分析的实用实现指南Mastering Machine Learning with Python in Six Steps, 2nd Edition A Practical Implementation Guide to Predictive Data Analytics Using Python

【新书】用Python六步掌握机器学习，第二版，469页pdf，使用Python进行预测数据分析的实用实现指南Mastering Machine Learning with Python in Six Steps, 2nd Edition A Practical Implementation Guide to Predictive Data Analytics Using Python

专知会员服务

88+阅读 · 2020年2月2日

【电子书推荐】Data Science with Python and Dask

【电子书推荐】Data Science with Python and Dask

专知会员服务

44+阅读 · 2019年6月1日

【2022新书】Python数据分析第三版，与Pandas、NumPy和Jupyter进行数据争论

【2022新书】Python数据分析第三版，与Pandas、NumPy和Jupyter进行数据争论

专知

8+阅读 · 2022年10月16日

【2022新书】Python数据科学导论，309页pdf

【2022新书】Python数据科学导论，309页pdf

专知

6+阅读 · 2022年8月6日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

推荐：使用Python实现机器学习特征选择的4种方法（附代码）

推荐：使用Python实现机器学习特征选择的4种方法（附代码）

数据分析

12+阅读 · 2019年4月14日

独家 | 使用Python实现机器学习特征选择的4种方法（附代码）

独家 | 使用Python实现机器学习特征选择的4种方法（附代码）

数据派THU

12+阅读 · 2019年4月12日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

七本书籍带你打下机器学习和数据科学的数学基础

七本书籍带你打下机器学习和数据科学的数学基础

云栖社区

26+阅读 · 2018年4月22日

【推荐】免费书(草稿)：数据科学的数学基础

【推荐】免费书(草稿)：数据科学的数学基础

机器学习研究会

20+阅读 · 2017年10月1日

【推荐】(Keras)LSTM多元时序预测教程

【推荐】(Keras)LSTM多元时序预测教程

机器学习研究会

24+阅读 · 2017年8月14日

深部煤层采煤机关键传动部件混叠故障解耦诊断理论研究

国家自然科学基金

1+阅读 · 2015年12月31日

光皮桦OFP基因在次生壁形成中的功能及调控机制

国家自然科学基金

0+阅读 · 2014年12月31日

美国数学会数学文摘中国联盟

国家自然科学基金

5+阅读 · 2014年12月31日

基于数值分析与现场监测自适应数据融合的尾矿库安全评估研究

国家自然科学基金

0+阅读 · 2014年12月31日

Setdb1调控多能性维持与重建的分子机理研究

国家自然科学基金

0+阅读 · 2013年12月31日

企业内部社交网络参与者在线与离线行为机制动态结构建模研究

国家自然科学基金

0+阅读 · 2013年12月31日

Pictet–Spengler类反应机理的理论研究和新反应设计

国家自然科学基金

0+阅读 · 2013年12月31日

RNA编辑影响RNA二级结构介导可变剪接的调控机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

现场激光诱导击穿光谱的化学计量学理论及方法集成

国家自然科学基金

0+阅读 · 2012年12月31日

广义Kloosterman和的均值估计

国家自然科学基金

0+阅读 · 2011年12月31日

Automatic Detection, Validation and Repair of Race Conditions in Interrupt-Driven Embedded Software

Arxiv

0+阅读 · 2023年5月29日

Arion: Arithmetization-Oriented Permutation and Hashing from Generalized Triangular Dynamical Systems

Arxiv

0+阅读 · 2023年5月28日

Revealing the Hidden Effects of Phishing Emails: An Analysis of Eye and Mouse Movements in Email Sorting Tasks

Arxiv

0+阅读 · 2023年5月26日

Packaging code for reproducible research in the public sector

Arxiv

0+阅读 · 2023年5月25日

Automated Tensor Model Parallelism with Overlapped Communication for Efficient Foundation Model Training

Arxiv

0+阅读 · 2023年5月25日

A continuum and computational framework for viscoelastodynamics: II. Strain-driven and energy-momentum consistent schemes

Arxiv

0+阅读 · 2023年5月25日

Dependency Update Strategies and Package Characteristics

Arxiv

0+阅读 · 2023年5月25日

Trends and Challenges Towards an Effective Data-Driven Decision Making in UK SMEs: Case Studies and Lessons Learnt from the Analysis of 85 SMEs

Arxiv

0+阅读 · 2023年5月24日

A Review and Roadmap of Deep Learning Causal Discovery in Different Variable Paradigms

Arxiv

22+阅读 · 2022年9月14日

Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction

Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction

Arxiv

10+阅读 · 2018年8月29日

VIP会员

文章信息

相关主题

笔记本电脑

相关VIP内容

【2023新书】使用Python进行统计和数据可视化，554页pdf

【2023新书】使用Python进行统计和数据可视化，554页pdf

专知会员服务

130+阅读 · 2023年1月29日

【2022新书】Python数据分析第三版，与Pandas、NumPy和Jupyter进行数据争论

【2022新书】Python数据分析第三版，与Pandas、NumPy和Jupyter进行数据争论

专知会员服务

121+阅读 · 2022年10月16日

【2022新书】Python数据分析第三版，579页pdf

【2022新书】Python数据分析第三版，579页pdf

专知会员服务

252+阅读 · 2022年8月31日

【2022新书】Python数据科学导论，309页pdf

【2022新书】Python数据科学导论，309页pdf

专知会员服务

82+阅读 · 2022年8月6日

剑桥大学《数据科学: 原理与实践》课程，附PPT下载

剑桥大学《数据科学: 原理与实践》课程，附PPT下载

专知会员服务

54+阅读 · 2021年1月20日

【2020新书】数据科学与机器学习导论，220页pdf

【2020新书】数据科学与机器学习导论，220页pdf

专知会员服务

81+阅读 · 2020年9月14日

数据科学导论，54页ppt，Introduction to Data Science

数据科学导论，54页ppt，Introduction to Data Science

专知会员服务

42+阅读 · 2020年7月27日

【新书】Python机器学习实战，545页pdf，Practical Machine Learning with Python

【新书】Python机器学习实战，545页pdf，Practical Machine Learning with Python

专知会员服务

310+阅读 · 2020年2月26日

【新书】用Python六步掌握机器学习，第二版，469页pdf，使用Python进行预测数据分析的实用实现指南Mastering Machine Learning with Python in Six Steps, 2nd Edition A Practical Implementation Guide to Predictive Data Analytics Using Python

【新书】用Python六步掌握机器学习，第二版，469页pdf，使用Python进行预测数据分析的实用实现指南Mastering Machine Learning with Python in Six Steps, 2nd Edition A Practical Implementation Guide to Predictive Data Analytics Using Python

专知会员服务

88+阅读 · 2020年2月2日

【电子书推荐】Data Science with Python and Dask

【电子书推荐】Data Science with Python and Dask

专知会员服务

44+阅读 · 2019年6月1日

热门VIP内容

开通专知VIP会员享更多权益服务

对北约军事总部战略规划制定与实施的研究 | 140页

【NeurIPS2025】VideoLucy：用于长视频理解的深度记忆回溯机制

俄罗斯军事规划差异性凸显其思维的重要性 | 2025最新文献

【NTU博士论文】端到端鲁棒自动语音识别的最新进展

相关资讯

【2022新书】Python数据分析第三版，与Pandas、NumPy和Jupyter进行数据争论

【2022新书】Python数据分析第三版，与Pandas、NumPy和Jupyter进行数据争论

专知

8+阅读 · 2022年10月16日

【2022新书】Python数据科学导论，309页pdf

【2022新书】Python数据科学导论，309页pdf

专知

6+阅读 · 2022年8月6日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

推荐：使用Python实现机器学习特征选择的4种方法（附代码）

推荐：使用Python实现机器学习特征选择的4种方法（附代码）

数据分析

12+阅读 · 2019年4月14日

独家 | 使用Python实现机器学习特征选择的4种方法（附代码）

独家 | 使用Python实现机器学习特征选择的4种方法（附代码）

数据派THU

12+阅读 · 2019年4月12日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

七本书籍带你打下机器学习和数据科学的数学基础

七本书籍带你打下机器学习和数据科学的数学基础

云栖社区

26+阅读 · 2018年4月22日

【推荐】免费书(草稿)：数据科学的数学基础

【推荐】免费书(草稿)：数据科学的数学基础

机器学习研究会

20+阅读 · 2017年10月1日

【推荐】(Keras)LSTM多元时序预测教程

【推荐】(Keras)LSTM多元时序预测教程

机器学习研究会

24+阅读 · 2017年8月14日

相关论文

Automatic Detection, Validation and Repair of Race Conditions in Interrupt-Driven Embedded Software

Arxiv

0+阅读 · 2023年5月29日

Arion: Arithmetization-Oriented Permutation and Hashing from Generalized Triangular Dynamical Systems

Arxiv

0+阅读 · 2023年5月28日

Revealing the Hidden Effects of Phishing Emails: An Analysis of Eye and Mouse Movements in Email Sorting Tasks

Arxiv

0+阅读 · 2023年5月26日

Packaging code for reproducible research in the public sector

Arxiv

0+阅读 · 2023年5月25日

Automated Tensor Model Parallelism with Overlapped Communication for Efficient Foundation Model Training

Arxiv

0+阅读 · 2023年5月25日

A continuum and computational framework for viscoelastodynamics: II. Strain-driven and energy-momentum consistent schemes

Arxiv

0+阅读 · 2023年5月25日

Dependency Update Strategies and Package Characteristics

Arxiv

0+阅读 · 2023年5月25日

Trends and Challenges Towards an Effective Data-Driven Decision Making in UK SMEs: Case Studies and Lessons Learnt from the Analysis of 85 SMEs

Arxiv

0+阅读 · 2023年5月24日

A Review and Roadmap of Deep Learning Causal Discovery in Different Variable Paradigms

Arxiv

22+阅读 · 2022年9月14日

Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction

Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction

Arxiv

10+阅读 · 2018年8月29日

相关基金

深部煤层采煤机关键传动部件混叠故障解耦诊断理论研究

国家自然科学基金

1+阅读 · 2015年12月31日

光皮桦OFP基因在次生壁形成中的功能及调控机制

国家自然科学基金

0+阅读 · 2014年12月31日

美国数学会数学文摘中国联盟

国家自然科学基金

5+阅读 · 2014年12月31日

基于数值分析与现场监测自适应数据融合的尾矿库安全评估研究

国家自然科学基金

0+阅读 · 2014年12月31日

Setdb1调控多能性维持与重建的分子机理研究

国家自然科学基金

0+阅读 · 2013年12月31日

企业内部社交网络参与者在线与离线行为机制动态结构建模研究

国家自然科学基金

0+阅读 · 2013年12月31日

Pictet–Spengler类反应机理的理论研究和新反应设计

国家自然科学基金

0+阅读 · 2013年12月31日

RNA编辑影响RNA二级结构介导可变剪接的调控机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

现场激光诱导击穿光谱的化学计量学理论及方法集成

国家自然科学基金

0+阅读 · 2012年12月31日

广义Kloosterman和的均值估计

国家自然科学基金

0+阅读 · 2011年12月31日

微信扫码咨询专知VIP会员