Understanding how different classes are distributed in an unlabeled data set is an important challenge for the calibration of probabilistic classifiers and uncertainty quantification. Approaches like adjusted classify and count, black-box shift estimators, and invariant ratio estimators use an auxiliary (and potentially biased) black-box classifier trained on a different (shifted) data set to estimate the class distribution and yield asymptotic guarantees under weak assumptions. We demonstrate that all these algorithms are closely related to the inference in a particular Bayesian model, approximating the assumed ground-truth generative process. Then, we discuss an efficient Markov Chain Monte Carlo sampling scheme for the introduced model and show an asymptotic consistency guarantee in the large-data limit. We compare the introduced model against the established point estimators in a variety of scenarios, and show it is competitive, and in some cases superior, with the state of the art.
翻译:了解不同类别如何在未贴标签的数据集中分布,对于校准概率分类器和不确定性量化来说,是一个重大挑战。调整分类和计数、黑盒转换估计器和不定比率估计器等方法使用辅助(和潜在偏差)黑盒分类器,对不同的(变换)数据集进行培训,以估计分类分布和在薄弱假设下产生无保障。我们证明所有这些算法都与特定巴伊西亚模型的推论密切相关,与假定的地面真实基因化过程相近。然后,我们讨论采用模型的高效的Markov 链条蒙特卡洛取样方案,并在大数据限度中显示无损一致性的保证。我们比较了在各种假设中采用的模型与既定点估计器比较,并显示其具有竞争力,有时甚至优于艺术状态。