Saliency maps have become a widely used method to make deep learning models more interpretable by providing post-hoc explanations of classifiers through identification of the most pertinent areas of the input medical image. They are increasingly being used in medical imaging to provide clinically plausible explanations for the decisions the neural network makes. However, the utility and robustness of these visualization maps has not yet been rigorously examined in the context of medical imaging. We posit that trustworthiness in this context requires 1) localization utility, 2) sensitivity to model weight randomization, 3) repeatability, and 4) reproducibility. Using the localization information available in two large public radiology datasets, we quantify the performance of eight commonly used saliency map approaches for the above criteria using area under the precision-recall curves (AUPRC) and structural similarity index (SSIM), comparing their performance to various baseline measures. Using our framework to quantify the trustworthiness of saliency maps, we show that all eight saliency map techniques fail at least one of the criteria and are, in most cases, less trustworthy when compared to the baselines. We suggest that their usage in the high-risk domain of medical imaging warrants additional scrutiny and recommend that detection or segmentation models be used if localization is the desired output of the network. Additionally, to promote reproducibility of our findings, we provide the code we used for all tests performed in this work at this link: https://github.com/QTIM-Lab/Assessing-Saliency-Maps.
翻译:通过通过识别输入医学图像中最相关的领域,对分类人员进行局部化信息,我们越来越多地在医学成像中使用这些特征,为神经网络做出的决定提供临床上可信的解释;然而,这些可视化地图的实用性和可靠性尚未在医学成像方面进行严格审查;我们认为,在这方面的可信任性要求:(1) 本地化效用,(2) 对模型体重随机化的敏感性,(3) 可重复性,(4) 可复制性。利用两个大型公共放射数据集中提供的本地化信息,我们量化了上述标准的八个常用突出地图方法的性能,在精确召回曲线(AUPRC)和结构相似性指数(SSIM)下,将这些可视化地图的实用性和可靠性与各种基线计量措施进行比较。我们利用我们的框架来量化显著性地图的可信任性,我们表明所有八种突出的地图技术至少都不符合标准之一,而且在大多数情况下,与基线相比,不那么可令人信服。 我们建议,在高风险网络中用于上述标准的八种常用突出性地图方法,如果我们使用这一高风险部分的医学成像系统,那么,那么,我们使用这种用于用于该目的成像/图的测试的模型,则需要我们用的数据序列检测/再进行更多的检验。