Calibration is a popular framework to evaluate whether a classifier knows when it does not know - i.e., its predictive probabilities are a good indication of how likely a prediction is to be correct. Correctness is commonly estimated against the human majority class. Recently, calibration to human majority has been measured on tasks where humans inherently disagree about which class applies. We show that measuring calibration to human majority given inherent disagreements is theoretically problematic, demonstrate this empirically on the ChaosNLI dataset, and derive several instance-level measures of calibration that capture key statistical properties of human judgements - class frequency, ranking and entropy.
翻译:校准是一个很受欢迎的框架,用以评估一个分类者是否知道自己何时不知道----即其预测概率很好地表明预测是否正确。通常对多数人进行正确估计。最近,根据人类对哪一类的内在差异,对多数人的校准是根据人类对哪一类的固有差异来衡量的。我们表明,在理论上,衡量对多数人的校准存在固有的分歧,在ChaosNLI数据集中以经验方式证明这一点,并得出若干例级校准标准,以捕捉人类判断的关键统计特性,即等级频率、等级和编码。