While the open-source model for software development has led to successful large-scale collaborations in building software systems, data science projects are frequently developed by individuals or small groups. We describe challenges to scaling data science collaborations and present a novel conceptual framework and ML programming model to address them. We instantiate these ideas in Ballet, a lightweight software framework for collaborative open-source data science and a cloud-based development environment, with a plugin for collaborative feature engineering. Using our framework, collaborators incrementally propose feature definitions to a repository which are each subjected to an ML evaluation and can be automatically merged into an executable feature engineering pipeline. We leverage Ballet to conduct an extensive case study analysis of a real-world income prediction problem, and discuss implications for collaborative projects.
翻译:虽然软件开发的开放源码模式成功地促成了在建立软件系统方面的大规模合作,但数据科学项目往往是由个人或小团体开发的。我们描述了在扩大数据科学合作方面的挑战,并提出了应对这些挑战的新概念框架和ML编程模式。我们在Ballet中即刻提出这些想法,Ballet是一个用于合作的开放源码数据科学和云基发展环境的轻量软件框架,并有一个合作性特征工程插件。合作者利用我们的框架,逐步向一个储存库提出特征定义,每个储存库都须接受 ML 评估,并可以自动合并为可执行的特征工程管道。我们利用Ballet对现实世界收入预测问题进行广泛的案例研究分析,并讨论合作项目的影响。