When designing a diagnostic model for a clinical application, it is crucial to guarantee the robustness of the model with respect to a wide range of image corruptions. Herein, an easy-to-use benchmark is established to evaluate how deep neural networks perform on corrupted pathology images. Specifically, corrupted images are generated by injecting nine types of common corruptions into validation images. Besides, two classification and one ranking metrics are designed to evaluate the prediction and confidence performance under corruption. Evaluated on two resulting benchmark datasets, we find that (1) a variety of deep neural network models suffer from a significant accuracy decrease (double the error on clean images) and the unreliable confidence estimation on corrupted images; (2) A low correlation between the validation and test errors while replacing the validation set with our benchmark can increase the correlation. Our codes are available on https://github.com/superjamessyx/robustness_benchmark.
翻译:在设计临床应用诊断模型时,关键是要保证模型在广泛的图像腐败方面的稳健性。在这里,建立了一个易于使用的基准,以评价深神经网络对腐败病理学图像的作用。具体地说,腐败图像是通过将九类常见腐败引入验证图像而生成的。此外,还设计了两个分类和一个等级衡量标准来评价腐败情况下的预测和信心绩效。根据两个基准数据集进行了评估,我们发现:(1) 各种深神经网络模型的精确度大幅下降(清洁图像的误差增加了一倍),对腐败图像的信任度估计不可靠;(2) 验证和测试错误之间的低相关性,同时以我们的基准取代验证数据集可以增加相关性。我们的代码可以在https://github.com/superjamesyx/robustness_benchmark上查阅。