Selective classifiers improve model reliability by abstaining on inputs the model deems uncertain. However, few practical approaches achieve the gold-standard performance of a perfect-ordering oracle that accepts examples exactly in order of correctness. Our work formalizes this shortfall as the selective-classification gap and present the first finite-sample decomposition of this gap to five distinct sources of looseness: Bayes noise, approximation error, ranking error, statistical noise, and implementation- or shift-induced slack. Crucially, our analysis reveals that monotone post-hoc calibration -- often believed to strengthen selective classifiers -- has limited impact on closing this gap, since it rarely alters the model's underlying score ranking. Bridging the gap therefore requires scoring mechanisms that can effectively reorder predictions rather than merely rescale them. We validate our decomposition on synthetic two-moons data and on real-world vision and language benchmarks, isolating each error component through controlled experiments. Our results confirm that (i) Bayes noise and limited model capacity can account for substantial gaps, (ii) only richer, feature-aware calibrators meaningfully improve score ordering, and (iii) data shift introduces a separate slack that demands distributionally robust training. Together, our decomposition yields a quantitative error budget as well as actionable design guidelines that practitioners can use to build selective classifiers which approximate ideal oracle behavior more closely.
翻译:选择性分类器通过拒绝对模型认为不确定的输入进行预测来提高模型可靠性。然而,现有方法很少能达到完美排序预言机的黄金标准性能,即严格按照正确性顺序接受样本。本研究将这一不足形式化为选择性分类差距,并首次提出了该差距在有限样本下分解为五个不同的松弛来源:贝叶斯噪声、近似误差、排序误差、统计噪声以及实现或分布偏移导致的松弛。关键的是,我们的分析表明,单调事后校准——通常被认为能增强选择性分类器——对缩小该差距的作用有限,因为它很少改变模型底层得分排序。因此,弥合差距需要能够有效重排预测而非仅重新缩放得分的评分机制。我们在合成双月形数据及真实世界视觉与语言基准测试上验证了该分解,通过受控实验分离了每个误差成分。结果证实:(i) 贝叶斯噪声与有限模型容量可解释显著差距,(ii) 仅更丰富的特征感知校准器能实质性改善得分排序,(iii) 数据偏移引入了独立松弛,需采用分布鲁棒性训练。综上,我们的分解提供了可量化的误差预算及可行的设计指南,使实践者能够构建更接近理想预言机行为的选择性分类器。