Due to the growing adoption of deep neural networks in many fields of science and engineering, modeling and estimating their uncertainties has become of primary importance. Despite the growing literature about uncertainty quantification in deep learning, the quality of the uncertainty estimates remains an open question. In this work, we assess for the first time the performance of several approximation methods for Bayesian neural networks on regression tasks by evaluating the quality of the confidence regions with several coverage metrics. The selected algorithms are also compared in terms of predictivity, kernelized Stein discrepancy and maximum mean discrepancy with respect to a reference posterior in both weight and function space. Our findings show that (i) some algorithms have excellent predictive performance but tend to largely over or underestimate uncertainties (ii) it is possible to achieve good accuracy and a given target coverage with finely tuned hyperparameters and (iii) the promising kernel Stein discrepancy cannot be exclusively relied on to assess the posterior approximation. As a by-product of this benchmark, we also compute and visualize the similarity of all algorithms and corresponding hyperparameters: interestingly we identify a few clusters of algorithms with similar behavior in weight space, giving new insights on how they explore the posterior distribution.
翻译:由于在许多科学和工程领域越来越多地采用深层神经网络,因此,建模和估计其不确定性已变得至关重要。尽管关于深层学习中不确定性量化的文献越来越多,但不确定性估计的质量仍然是一个尚未解决的问题。在这项工作中,我们首次通过评估信任区域质量和若干覆盖范围指标来评估巴伊西亚神经网络关于回归任务的若干近似方法的性能。所选算法还比较了预测性、内脏化的斯坦因偏差和在重量和功能空间中参考后台的最大平均差异。我们的调查结果显示:(一) 一些算法的预测性表现优异,但往往大都高于或低估不确定性;(二) 有可能以微调超分度和(三) 不能完全依靠大有希望的内核斯坦尔偏差来评估后台近率。作为这一基准的副产品,我们还对所有算法和相应的超直径计的相似性能进行了比较和视觉化。我们很有意思的是,我们发现了一些具有类似空间深度的远方位分析方法的组合,对空间进行探索。