校正 ViT 快捷键学习, 校正以视觉感光度 (Rectify ViT Shortcut Learning by Visual Saliency)

Shortcut learning is common but harmful to deep learning models, leading to degenerated feature representations and consequently jeopardizing the model's generalizability and interpretability. However, shortcut learning in the widely used Vision Transformer framework is largely unknown. Meanwhile, introducing domain-specific knowledge is a major approach to rectifying the shortcuts, which are predominated by background related factors. For example, in the medical imaging field, eye-gaze data from radiologists is an effective human visual prior knowledge that has the great potential to guide the deep learning models to focus on meaningful foreground regions of interest. However, obtaining eye-gaze data is time-consuming, labor-intensive and sometimes even not practical. In this work, we propose a novel and effective saliency-guided vision transformer (SGT) model to rectify shortcut learning in ViT with the absence of eye-gaze data. Specifically, a computational visual saliency model is adopted to predict saliency maps for input image samples. Then, the saliency maps are used to distil the most informative image patches. In the proposed SGT, the self-attention among image patches focus only on the distilled informative ones. Considering this distill operation may lead to global information lost, we further introduce, in the last encoder layer, a residual connection that captures the self-attention across all the image patches. The experiment results on four independent public datasets show that our SGT framework can effectively learn and leverage human prior knowledge without eye gaze data and achieves much better performance than baselines. Meanwhile, it successfully rectifies the harmful shortcut learning and significantly improves the interpretability of the ViT model, demonstrating the promise of transferring human prior knowledge derived visual saliency in rectifying shortcut learning

翻译：捷径学习是常见的,但有害于深层学习模式,导致地貌表现退化,从而危及模型的通用性和可解释性。然而,在广泛使用的视野变异器框架内的捷径学习基本上不为人所知。同时,引入具体领域的知识是纠正捷径的主要方法,而捷径主要是背景相关因素。例如,在医学成像领域,放射师的眼视透镜数据是一种有效的人类视觉先前知识,它极有可能引导深层学习模式,以关注有意义的浅浅浅浅区域。然而,获取眼凝数据耗费时间,劳动密集型,有时甚至不切实际。在此工作中,我们建议采用新颖而有效的显性指导视觉变异器(SGT)模型,以纠正在维特的捷径学习中,缺乏视觉变异性数据。具体地,采用计算视觉显性模型来预测输入图像样本的显性地图。随后,显性地图被用来挖取信息性最强的直径图像传输区。在SGT(SGT)中,在图像变异的自我观察中展示,在图像变异的自我变现过程中,仅仅显示着的视觉变异性变异性变异性数据,只是我们之前的预了解了数据。我们之前的路径。