It is useful to estimate the expected predictive performance of models planned to be used for prediction. We focus on leave-one-out cross-validation (LOO-CV), which has become a popular method for estimating predictive performance of Bayesian models. Given two models, we are interested in comparing the predictive performances and associated uncertainty, which can also be used to compute the probability of one model having better predictive performance than the other model. We study the properties of the Bayesian LOO-CV estimator and the related uncertainty quantification for the predictive performance difference, and analyse when a normal approximation of this uncertainty is well calibrated and whether taking into account higher moments could improve the approximation. We provide new results of the properties both theoretically in the linear regression case and empirically for hierarchical linear, latent linear, and spline models and discuss the challenges. We show that problematic cases include: comparing models with similar predictions, misspecified models, and small data. In these cases, there is a weak connection between the distributions of the LOO-CV estimator and its error. We show that that the problematic skewness of the error distribution for the difference, which occurs when the models make similar predictions, does not fade away when the data size grows to infinity in certain situations. Based on the results, we also provide some practical recommendations for the users of Bayesian LOO-CV for comparing predictive performance of models.
翻译:估计计划用于预测的模型的预期预测性能具有重要意义。本文聚焦于留一交叉验证(LOO-CV)方法,该方法已成为评估贝叶斯模型预测性能的常用技术。针对两个模型,我们关注其预测性能的比较及相关不确定性,这种不确定性也可用于计算一个模型比另一个模型具有更好预测性能的概率。我们研究了贝叶斯LOO-CV估计量的性质及其对预测性能差异的不确定性量化方法,分析了该不确定性的正态近似何时具有良好校准性,以及考虑更高阶矩是否能够改进近似效果。我们在线性回归情形下从理论上、在分层线性模型、潜在线性模型和样条模型中通过实证研究,提供了关于这些性质的新结果并讨论了相关挑战。研究表明存在问题的情形包括:比较预测结果相似的模型、错误设定模型以及小样本数据。在这些情况下,LOO-CV估计量的分布与其误差分布之间存在弱关联。我们证明当模型预测结果相似时,差异误差分布存在的有偏性问题在某些情况下不会随数据量趋于无穷而消失。基于研究结果,我们为使用贝叶斯LOO-CV进行模型预测性能比较的研究者提供了若干实践建议。