Datasets are rarely a realistic approximation of the target population. Say, prevalence is misrepresented, image quality is above clinical standards, etc. This mismatch is known as sampling bias. Sampling biases are a major hindrance for machine learning models. They cause significant gaps between model performance in the lab and in the real world. Our work is a solution to prevalence bias. Prevalence bias is the discrepancy between the prevalence of a pathology and its sampling rate in the training dataset, introduced upon collecting data or due to the practioner rebalancing the training batches. This paper lays the theoretical and computational framework for training models, and for prediction, in the presence of prevalence bias. Concretely a bias-corrected loss function, as well as bias-corrected predictive rules, are derived under the principles of Bayesian risk minimization. The loss exhibits a direct connection to the information gain. It offers a principled alternative to heuristic training losses and complements test-time procedures based on selecting an operating point from summary curves. It integrates seamlessly in the current paradigm of (deep) learning using stochastic backpropagation and naturally with Bayesian models.
翻译:数据集很少是目标人群的现实近似值。 说, 流行程度被歪曲, 图像质量高于临床标准等等。 这种不匹配被称为抽样偏差。 抽样偏差是机器学习模型的一大障碍, 它们在实验室和现实世界的模型性能之间造成了巨大差距。 我们的工作是解决流行偏差的解决方案。 流行偏差是培训数据集中病理学普遍程度与其抽样率之间的差异, 这是在收集数据时引入的, 也是由于专业人员对培训批次进行再平衡的结果。 本文为培训模型和预测提供了理论和计算框架, 其存在流行偏差的偏差。 具体来说, 一种纠正偏差的损失功能以及纠正偏差的预测规则是依据巴耶斯风险最小化原则得出的。 损失表明与信息收益的直接关联。 它提供了一种原则性的培训损失替代方法,并且补充了基于从简要曲线中选择操作点的测试时间程序。 它与( 持续) 学习模式无缝合。