In supervised learning, the estimation of prediction error on unlabeled test data is an important task. Existing methods are usually built on the assumption that the training and test data are sampled from the same distribution, which is often violated in practice. As a result, traditional estimators like cross-validation (CV) will be biased and this may result in poor model selection. In this paper, we assume that we have a test dataset in which the feature values are available but not the outcome labels, and focus on a particular form of distributional shift called "covariate shift". We propose an alternative method based on parametric bootstrap of the target of conditional error. Empirically, our method outperforms CV for both simulation and real data example across different modeling tasks.
翻译:在受监督的学习中,对未贴标签的测试数据预测错误的估计是一项重要任务。 现有方法通常基于以下假设:培训和测试数据是从同一分布中抽样的,而这种分布在实际中经常被违反。 结果,传统估计数据,如交叉校验(CV),会有偏差,这可能导致模型选择不力。 在本文中,我们假设我们有一个测试数据集,其中特征值可用,结果标签不可用,并侧重于一种特定形式的分布式转移,即“差值转换 ” 。 我们提出了一种基于有条件错误目标的参数靴子的替代方法。 典型地说,我们的方法在模拟和真实数据示例方面,在不同模型任务中都优于CV。