Nowadays, numerous industries have exceptional demand for skills in data science, such as data analysis, data mining, and machine learning. The computational notebook (e.g., Jupyter Notebook) is a well-known data science tool adopted in practice. Kaggle and GitHub are two platforms where data science communities are used for knowledge-sharing, skill-practicing, and collaboration. While tutorials and guidelines for novice data science are available on both platforms, there is a low number of Jupyter Notebooks that received high numbers of votes from the community. The high-voted notebook is considered well-documented, easy to understand, and applies the best data science and software engineering practices. In this research, we aim to understand the characteristics of high-voted Jupyter Notebooks on Kaggle and the popular Jupyter Notebooks for data science projects on GitHub. We plan to mine and analyse the Jupyter Notebooks on both platforms. We will perform exploratory analytics, data visualization, and feature importances to understand the overall structure of these notebooks and to identify common patterns and best-practice features separating the low-voted and high-voted notebooks. Upon the completion of this research, the discovered insights can be applied as training guidelines for aspiring data scientists and machine learning practitioners looking to improve their performance from novice ranking Jupyter Notebook on Kaggle to a deployable project on GitHub.
翻译:如今,许多行业对数据科学技能(如数据分析、数据挖掘和机器学习)有着异常的需求。计算笔记本(如Jupyter Notebook)是一种在实践中被广泛采用的数据科学工具。 Kaggle 和 GitHub 是两个数据科学社区,用于知识共享、技能练习和协作。尽管 Kaggle 和 GitHub 上都有初学者数据科学的教程和指南,但收到社区高票评价的 Jupyter Notebooks 数量较少。高票笔记本通常被认为是文档详尽、易于理解、符合最佳数据科学和软件工程实践的笔记本。在这项研究中,我们旨在了解 Kaggle 上的高票 Jupyter Notebooks 特征 和 GitHub 上数据科学项目中流行的 Jupyter Notebooks 特征。我们计划挖掘和分析这两个平台上的 Jupyter Notebooks。我们将进行探索性分析、数据可视化和特征重要性分析,以理解这些笔记本的整体结构,并识别将低票和高票笔记本区分开的常见模式和最佳实践特征。完成该研究后,发现的洞察可以作为指导初学数据科学家和机器学习从业者的培训指南,帮助他们从Kaggle的新手排名技能提高到在GitHub上发布的实际项目。