Reliability of machine learning evaluation -- the consistency of observed evaluation scores across replicated model training runs -- is affected by several sources of nondeterminism which can be regarded as measurement noise. Current tendencies to remove noise in order to enforce reproducibility of research results neglect inherent nondeterminism at the implementation level and disregard crucial interaction effects between algorithmic noise factors and data properties. This limits the scope of conclusions that can be drawn from such experiments. Instead of removing noise, we propose to incorporate several sources of variance, including their interaction with data properties, into an analysis of significance and reliability of machine learning evaluation, with the aim to draw inferences beyond particular instances of trained models. We show how to use linear mixed effects models (LMEMs) to analyze performance evaluation scores, and to conduct statistical inference with a generalized likelihood ratio test (GLRT). This allows us to incorporate arbitrary sources of noise like meta-parameter variations into statistical significance testing, and to assess performance differences conditional on data properties. Furthermore, a variance component analysis (VCA) enables the analysis of the contribution of noise sources to overall variance and the computation of a reliability coefficient by the ratio of substantial to total variance.
翻译:机器学习评价的可靠性 -- -- 在复制模式培训过程中观察到的评价分数的一致性 -- -- 受到若干不确定性来源的影响,这些来源可被视为测量噪音;目前消除噪音的趋势,以强制实施研究成果的可复制性;忽视执行一级固有的不确定性,忽视算法噪音因素和数据特性之间的关键互动效应;这限制了从这种实验中得出的结论的范围;我们提议在分析机器学习评价的重要性和可靠性时,纳入若干差异来源,包括它们与数据特性的相互作用,目的是在经过培训的模型的特定例子之外得出推论;我们展示如何使用线性混合效应模型分析业绩评价分数,并用普遍概率比值测试进行统计推论;这使我们能够将诸如元参数变化等任意的噪音来源纳入统计意义测试,并根据数据特性评估性差。此外,差异组成部分分析有助于分析噪音来源对总体差异的贡献,以及根据总体差异与总体差异的比例计算可靠性系数。