Background: Many published machine learning studies are irreproducible. Issues with methodology and not properly accounting for variation introduced by the algorithm themselves or their implementations are attributed as the main contributors to the irreproducibility.Problem: There exist no theoretical framework that relates experiment design choices to potential effects on the conclusions. Without such a framework, it is much harder for practitioners and researchers to evaluate experiment results and describe the limitations of experiments. The lack of such a framework also makes it harder for independent researchers to systematically attribute the causes of failed reproducibility experiments. Objective: The objective of this paper is to develop a framework that enable applied data science practitioners and researchers to understand which experiment design choices can lead to false findings and how and by this help in analyzing the conclusions of reproducibility experiments. Method: We have compiled an extensive list of factors reported in the literature that can lead to machine learning studies being irreproducible. These factors are organized and categorized in a reproducibility framework motivated by the stages of the scientific method. The factors are analyzed for how they can affect the conclusions drawn from experiments. A model comparison study is used as an example. Conclusion: We provide a framework that describes machine learning methodology from experimental design decisions to the conclusions inferred from them.
翻译:背景:许多发表的机器学习研究不可重复。研究方法和没有正确考虑算法本身或其实现所引入的差异的问题被归因为不可重复性的主要原因。
问题:不存在将实验设计选择与可能影响结论的理论框架。没有这样的框架,从业人员和研究人员更难评估实验结果并描述实验的限制。缺乏这样的框架也使得独立研究人员更难系统地归因于无法重复的实验的原因。
目标:本文的目标是开发一个框架,使应用数据科学从业人员和研究人员能够了解哪些实验设计选择可能导致假发现,以及如何通过这些帮助分析可重复性实验的结论。
方法:我们编制了一份广泛的因素列表,这些因素在文献中报告,可以导致机器学习研究不可重复。这些因素按照科学方法的阶段进行组织和分类,构建一个可重复性框架。分析这些因素如何影响实验结果的结论。一个模型比较研究被用作示例。
结论:我们提供了一个框架,描述了从实验设计决策到从中推断出的结论的机器学习方法。