选举后预测性业绩信任区 (Post-Selection Confidence Bounds for Prediction Performance)

from arxiv, 17 pages, 13 figures, 3 tables. Submitted to the Springer Machine Learning Journal. Changes to version 2: made figures easier to read; corrected a minor typo

In machine learning, the selection of a promising model from a potentially large number of competing models and the assessment of its generalization performance are critical tasks that need careful consideration. Typically, model selection and evaluation are strictly separated endeavors, splitting the sample at hand into a training, validation, and evaluation set, and only compute a single confidence interval for the prediction performance of the final selected model. We however propose an algorithm how to compute valid lower confidence bounds for multiple models that have been selected based on their prediction performances in the evaluation set by interpreting the selection problem as a simultaneous inference problem. We use bootstrap tilting and a maxT-type multiplicity correction. The approach is universally applicable for any combination of prediction models, any model selection strategy, and any prediction performance measure that accepts weights. We conducted various simulation experiments which show that our proposed approach yields lower confidence bounds that are at least comparably good as bounds from standard approaches, and that reliably reach the nominal coverage probability. In addition, especially when sample size is small, our proposed approach yields better performing prediction models than the default selection of only one model for evaluation does.

翻译：在机器学习中,从众多可能相互竞争的模型中选择一个有希望的模式,并评估其总体性能,这些都是需要认真考虑的关键任务。通常,模型选择和评价是严格分开的努力,将手头的样本分成一个培训、鉴定和评价组,并且只计算最后选定的模型预测性能的单一信任间隔。然而,我们提出一个算法,如何根据多个模型的预测性能来计算其有效的较低信任界限,这些模型是根据其在评估中的预测性能而选定的,通过将选择问题解释为同时发生的推论问题。我们使用靴套倾斜和最大T型多重校正。该方法普遍适用于预测性模型的任何组合、任何模型选择战略和任何接受重量的预测性业绩计量。我们进行了各种模拟实验,这些实验表明,我们拟议的方法产生的信任界限比标准方法的界限要低,而且至少比得上好,而且可靠地达到名义覆盖概率。此外,在样本规模小的情况下,我们提议的方法比默认选择的评价模型的准确性要好。