机器学习管道:预测、可复制和FAIR数据原则 (Machine Learning Pipelines: Provenance, Reproducibility and FAIR Data Principles)

Machine learning (ML) is an increasingly important scientific tool supporting decision making and knowledge generation in numerous fields. With this, it also becomes more and more important that the results of ML experiments are reproducible. Unfortunately, that often is not the case. Rather, ML, similar to many other disciplines, faces a reproducibility crisis. In this paper, we describe our goals and initial steps in supporting the end-to-end reproducibility of ML pipelines. We investigate which factors beyond the availability of source code and datasets influence reproducibility of ML experiments. We propose ways to apply FAIR data practices to ML workflows. We present our preliminary results on the role of our tool, ProvBook, in capturing and comparing provenance of ML experiments and their reproducibility using Jupyter Notebooks.

翻译：机器学习(ML)是一个日益重要的科学工具,用于支持许多领域的决策和知识生成。有了这个工具,ML实验的结果也越来越重要。不幸的是,情况往往并非如此。相反,ML与其他许多学科相似,面临着复制危机。在本文中,我们描述了支持ML管道端到端再复制的目标和初步步骤。我们调查了超出源代码和数据集的哪些因素影响ML实验的可复制性。我们提出了将FAIR数据实践应用于ML工作流程的方法。我们介绍了我们工具ProvBook在捕捉和比较ML实验的源代码及其利用Jupyter笔记本进行复制方面的作用的初步结果。