Within the last few years, there has been a move towards using statistical models in conjunction with neural networks with the end goal of being able to better answer the question, "what do our models know?". From this trend, classical metrics such as Prediction Interval Coverage Probability (PICP) and new metrics such as calibration error have entered the general repertoire of model evaluation in order to gain better insight into how the uncertainty of our model compares to reality. One important component of uncertainty modeling is model uncertainty (epistemic uncertainty), a measurement of what the model does and does not know. However, current evaluation techniques tends to conflate model uncertainty with aleatoric uncertainty (irreducible error), leading to incorrect conclusions. In this paper, using posterior predictive checks, we show how calibration error and its variants are almost always incorrect to use given model uncertainty, and further show how this mistake can lead to trust in bad models and mistrust in good models. Though posterior predictive checks has often been used for in-sample evaluation of Bayesian models, we show it still has an important place in the modern deep learning world.
翻译:在过去几年里,人们开始与神经网络一起使用统计模型,最终目标是更好地回答“我们的模型知道什么”的问题。从这一趋势中,古典指标,如预测间覆盖概率(PICP)和校准错误等新指标,进入了模型评价的总系列,以便更好地了解我们模型的不确定性如何与现实相比较。不确定性模型的一个重要部分是模型不确定性(普遍不确定性),这是衡量模型所做和不知道的事情的尺度。然而,目前的评估技术往往将模型不确定性与感知不确定性(可减轻错误)混为一谈,从而得出错误的结论。在本文中,我们利用事后预测检查,表明校准错误及其变量如何几乎总是不正确地使用模型不确定性,并进一步表明这一错误如何导致对坏模型的信任和对好模型的不信任。尽管事后预测性检查经常被用于对巴伊西亚模型进行抽样评估,但我们在现代深层次的学习世界中展示了它的重要位置。