Increasingly larger number of software systems today are including data science components for descriptive, predictive, and prescriptive analytics. The collection of data science stages from acquisition, to cleaning/curation, to modeling, and so on are referred to as data science pipelines. To facilitate research and practice on data science pipelines, it is essential to understand their nature. What are the typical stages of a data science pipeline? How are they connected? Do the pipelines differ in the theoretical representations and that in the practice? Today we do not fully understand these architectural characteristics of data science pipelines. In this work, we present a three-pronged comprehensive study to answer this for the state-of-the-art, data science in-the-small, and data science in-the-large. Our study analyzes three datasets: a collection of 71 proposals for data science pipelines and related concepts in theory, a collection of over 105 implementations of curated data science pipelines from Kaggle competitions to understand data science in-the-small, and a collection of 21 mature data science projects from GitHub to understand data science in-the-large. Our study has led to three representations of data science pipelines that capture the essence of our subjects in theory, in-the-small, and in-the-large.
翻译:今天越来越多的软件系统包括描述、预测和规范分析的数据科学组成部分。收集数据科学阶段,从获取到清理/整理,到建模,等等被称为数据科学管道。为了便利数据科学管道的研究和实践,至关重要的是要了解它们的性质。数据科学管道的典型阶段是什么?它们是如何连接的?管道在理论表述和实践方面是否有所不同?今天,我们不完全理解数据科学管道的建筑特征。在这项工作中,我们提出三管齐下的综合研究,以回答这个问题,即从获取到清洁/整理,到建模,到建模等数据科学阶段。我们的研究分析了三个数据集:收集71个数据科学管道和相关理论概念的建议,收集超过105个从卡格格勒竞争到了解数据科学在小型科学的校准流,以及从吉特胡伯到了解数据科学的21个成熟数据科学项目,以了解目前的最新、小型和大规模的数据科学。我们的研究在理论领域收集了三个模型。