For users to trust model predictions, they need to understand model outputs, particularly their confidence - calibration aims to adjust (calibrate) models' confidence to match expected accuracy. We argue that the traditional calibration evaluation does not promote effective calibrations: for example, it can encourage always assigning a mediocre confidence score to all predictions, which does not help users distinguish correct predictions from wrong ones. Building on those observations, we propose a new calibration metric, MacroCE, that better captures whether the model assigns low confidence to wrong predictions and high confidence to correct predictions. Focusing on the practical application of open-domain question answering, we examine conventional calibration methods applied on the widely-used retriever-reader pipeline, all of which do not bring significant gains under our new MacroCE metric. Toward better calibration, we propose a new calibration method (ConsCal) that uses not just final model predictions but whether multiple model checkpoints make consistent predictions. Altogether, we provide an alternative view of calibration along with a new metric, re-evaluation of existing calibration methods on our metric, and proposal of a more effective calibration method.
翻译:为了让用户相信模型预测,他们需要理解模型产出,特别是他们的信心 -- -- 校准的目的是调整(校准)模型的可信度以适应预期的准确性。我们争辩说,传统的校准评价并不促进有效的校准:例如,它可以鼓励总是将中度信任评分分配给所有预测,这无助于用户区分正确的预测和错误的预测。根据这些观察,我们提议一个新的校准度量(MrocorCE),它更好地捕捉模型是否对错误的预测抱有低度信心,以及是否对纠正预测抱有高度信心。我们注重开放域问题回答的实际应用,我们审查在广泛使用的检索器-读取器管道上应用的常规校准方法,所有这些方法在我们的新的MROCE指标下并没有带来重大收益。为了更好的校准,我们提出了一个新的校准方法(ConsCal),它不仅使用最后的模型预测,而且使用多个模型检查站是否作出一致的预测。总而言,我们提供了一种校准的替代观点,同时使用新的衡量标准,重新评价现有的校准方法,并提议更有效的校准方法。