Reproducibility is a crucial aspect of scientific research that involves the ability to independently replicate experimental results by analysing the same data or repeating the same experiment. Over the years, many works have been proposed to make the results of the experiments actually reproducible. However, very few address the importance of data reproducibility, defined as the ability of independent researchers to retain the same dataset used as input for experimentation. Properly addressing the problem of data reproducibility is crucial because often just providing a link to the data is not enough to make the results reproducible. In fact, also proper metadata (e.g., preprocessing instruction) must be provided to make a dataset fully reproducible. In this work, our aim is to fill this gap by proposing a decision tree to sheperd researchers through the reproducibility of their datasets. In particular, this decision tree guides researchers through identifying if the dataset is actually reproducible and if additional metadata (i.e., additional resources needed to reproduce the data) must also be provided. This decision tree will be the foundation of a future application that will automate the data reproduction process by automatically providing the necessary metadata based on the particular context (e.g., data availability, data preprocessing, and so on). It is worth noting that, in this paper, we detail the steps to make a dataset retrievable, while we will detail other crucial aspects for reproducibility (e.g., dataset documentation) in future works.
翻译:摘要:再现性是科学研究的关键方面,它涉及到独立重现实验结果的能力,即通过分析相同的数据或重复相同的实验来独立复制实验结果。多年来,许多方法已经被提出来使实验结果真正具有再现性。然而,非常少的方法涉及数据再现性的重要性,也就是独立研究人员具有保留用作实验输入的相同数据集的能力。妥善解决数据再现性问题非常重要,因为仅仅提供数据的链接通常不足以使结果具有再现性。实际上,为使数据集完全可再现,还必须提供适当的元数据(例如,预处理说明)。在本工作中,我们的目的是通过提出一种决策树来引导研究人员完成数据集的可复制性来填补这一空白。具体而言,这个决策树指导研究人员确定数据集是否真正可再现,以及是否还必须提供附加元数据(即为复制数据所需的附加资源)。这个决策树将成为未来应用程序的基础,该应用程序将根据特定上下文(例如,数据可用性、数据预处理等)自动提供必要的元数据来自动化数据复制过程。值得注意的是,在本文中,我们详细介绍了使数据集可检索的步骤,而将在未来的研究中详细介绍再现性的其他关键方面(例如,数据集文档)。