While the open-source software development model has led to successful large-scale collaborations in building software systems, data science projects are frequently developed by individuals or small teams. We describe challenges to scaling data science collaborations and present a conceptual framework and ML programming model to address them. We instantiate these ideas in Ballet, a lightweight framework for collaborative, open-source data science through a focus on feature engineering, and an accompanying cloud-based development environment. Using our framework, collaborators incrementally propose feature definitions to a repository which are each subjected to an ML performance evaluation and can be automatically merged into an executable feature engineering pipeline. We leverage Ballet to conduct a case study analysis of an income prediction problem with 27 collaborators, and discuss implications for future designers of collaborative projects.
 翻译:虽然开放源码软件开发模型成功地促成了在建立软件系统方面的大规模合作,但数据科学项目往往由个人或小团队开发。我们描述了在扩大数据科学合作方面的挑战,并提出了概念框架和ML编程模型来解决这些问题。我们在Ballet中即刻录了这些想法,Ballet是一个合作、开放源码数据科学的轻量级框架,重点是地物工程和伴随的云型开发环境。合作者利用我们的框架,逐步向一个储存库提出特征定义,每个储存库都须接受ML性能评估,并可以自动并入一个可执行的特征工程管道。我们利用Ballet与27个合作者一起对收入预测问题进行案例研究,并讨论对未来合作项目设计者的影响。