Bayesian deep learning seeks to equip deep neural networks with the ability to precisely quantify their predictive uncertainty, and has promised to make deep learning more reliable for safety-critical real-world applications. Yet, existing Bayesian deep learning methods fall short of this promise; new methods continue to be evaluated on unrealistic test beds that do not reflect the complexities of downstream real-world tasks that would benefit most from reliable uncertainty quantification. We propose the RETINA Benchmark, a set of real-world tasks that accurately reflect such complexities and are designed to assess the reliability of predictive models in safety-critical scenarios. Specifically, we curate two publicly available datasets of high-resolution human retina images exhibiting varying degrees of diabetic retinopathy, a medical condition that can lead to blindness, and use them to design a suite of automated diagnosis tasks that require reliable predictive uncertainty quantification. We use these tasks to benchmark well-established and state-of-the-art Bayesian deep learning methods on task-specific evaluation metrics. We provide an easy-to-use codebase for fast and easy benchmarking following reproducibility and software design principles. We provide implementations of all methods included in the benchmark as well as results computed over 100 TPU days, 20 GPU days, 400 hyperparameter configurations, and evaluation on at least 6 random seeds each.
翻译:Bayesian深层的学习力求使深神经网络具备准确量化其预测不确定性的能力,并承诺使深入学习更可靠地用于安全临界现实世界应用。然而,现有的Bayesian深层学习方法没有实现这一承诺;仍在不切实际的测试床上评价新方法,这些测试床没有反映下游现实世界任务的复杂性,而这种任务最能从可靠的不确定性量化中受益。我们提议了RETINA基准,这是一套能准确反映这种复杂性的现实世界任务,旨在评估安全临界情景中预测模型的可靠性。具体地说,我们整理了两种公开存在的高分辨率人类视网格图像数据集,显示不同程度的糖尿病视网格,这是一种可能导致失明的医疗条件,并利用它们设计一套自动诊断任务,需要可靠的预测不确定性量化。我们利用这些任务来为既定和最先进的Bayesian深层次学习方法制定基准。我们提供了一个易于使用的代码库,用于在超过可复制性和软件设计原则之后快速和容易进行基准化的。我们提供了每6天的40天的甚高分辨率模型,我们提供所有方法的基准,作为每100天的每40天的基数基数计算结果的基准。