We study the statistical properties of learning to defer (L2D) to multiple experts. In particular, we address the open problems of deriving a consistent surrogate loss, confidence calibration, and principled ensembling of experts. Firstly, we derive two consistent surrogates -- one based on a softmax parameterization, the other on a one-vs-all (OvA) parameterization -- that are analogous to the single expert losses proposed by Mozannar and Sontag (2020) and Verma and Nalisnick (2022), respectively. We then study the frameworks' ability to estimate P( m_j = y | x ), the probability that the jth expert will correctly predict the label for x. Theory shows the softmax-based loss causes mis-calibration to propagate between the estimates while the OvA-based loss does not (though in practice, we find there are trade offs). Lastly, we propose a conformal inference technique that chooses a subset of experts to query when the system defers. We perform empirical validation on tasks for galaxy, skin lesion, and hate speech classification.
翻译:我们研究学习将(L2D)推迟给多位专家的统计特性。 特别是, 我们解决了得出一致替代损失、 信心校准和有原则的专家组合的公开问题。 首先, 我们得出了两个一致的代孕人 -- -- 一个基于软负轴参数化,另一个基于一V- 全(OvA)参数化 -- -- 类似于Mozanarar和Sontag(202020年)以及Verma和Nalissnick(2022年)分别提出的单一专家损失。 然后我们研究了框架估算P( m_ j = y ⁇ x) 的能力, Jth 专家正确预测 x 标签的概率。 理论显示, 软负负损失导致估计数之间的误差, 而基于OvA 的损失并没有( 虽然在实践中, 我们发现有交易脱钩 ) 。 最后, 我们提出一种一致的推论方法, 选择在系统推迟时进行查询的专家组。 我们对星系、 皮肤 和仇恨言论分类的任务进行了实证性鉴定。