As neural networks become more popular, the need for accompanying uncertainty estimates increases. The current testing methodology focusses on how good the predictive uncertainty estimates explain the differences between predictions and observations in a previously unseen test set. Intuitively this is a logical approach. The current setup of benchmark data sets also allows easy comparison between the different methods. We demonstrate, however, through both theoretical arguments and simulations that this way of evaluating the quality of uncertainty estimates has serious flaws. Firstly, it cannot disentangle the aleatoric from the epistemic uncertainty. Secondly, the current methodology considers the uncertainty averaged over all test samples, implicitly averaging out overconfident and underconfident predictions. When checking if the correct fraction of test points falls inside prediction intervals, a good score on average gives no guarantee that the intervals are sensible for individual points. We demonstrate through practical examples that these effects can result in favoring a method, based on the predictive uncertainty, that has undesirable behaviour of the confidence intervals. Finally, we propose a simulation-based testing approach that addresses these problems while still allowing easy comparison between different methods.
翻译:随着神经网络越来越普遍,随之而来的不确定性估计需要增加。当前的测试方法侧重于预测性不确定性估计如何很好地解释先前看不见的测试集中的预测和观察之间的差异。 直观地说,这是一种逻辑的方法。 目前的基准数据集的设置也容易地比较不同的方法。 然而,我们通过理论论和模拟表明,这种评估不确定性估计质量的方法存在严重缺陷。首先,它不能将疏导的解析与共认性不确定性分解开来。第二,目前的方法考虑了所有测试样本的不确定性平均值,隐含地平均过自信和不自信的预测。在检查测试点的正确部分是否在预测间隔内时,平均的得分并不能保证每个点的间隔合理。我们通过实际例子表明,这些效应能够有利于一种基于预测性不确定性的方法,从而产生信任期的不良行为。最后,我们建议一种基于模拟的测试方法,既能解决这些问题,又能方便不同方法之间的比较。