Prototypical part network (ProtoPNet) has drawn wide attention and boosted many follow-up studies due to its self-explanatory property for explainable artificial intelligence (XAI). However, when directly applying ProtoPNet on vision transformer (ViT) backbones, learned prototypes have a "distraction" problem: they have a relatively high probability of being activated by the background and pay less attention to the foreground. The powerful capability of modeling long-term dependency makes the transformer-based ProtoPNet hard to focus on prototypical parts, thus severely impairing its inherent interpretability. This paper proposes prototypical part transformer (ProtoPFormer) for appropriately and effectively applying the prototype-based method with ViTs for interpretable image recognition. The proposed method introduces global and local prototypes for capturing and highlighting the representative holistic and partial features of targets according to the architectural characteristics of ViTs. The global prototypes are adopted to provide the global view of objects to guide local prototypes to concentrate on the foreground while eliminating the influence of the background. Afterwards, local prototypes are explicitly supervised to concentrate on their respective prototypical visual parts, increasing the overall interpretability. Extensive experiments demonstrate that our proposed global and local prototypes can mutually correct each other and jointly make final decisions, which faithfully and transparently reason the decision-making processes associatively from the whole and local perspectives, respectively. Moreover, ProtoPFormer consistently achieves superior performance and visualization results over the state-of-the-art (SOTA) prototype-based baselines. Our code has been released at https://github.com/zju-vipa/ProtoPFormer.
翻译:模型化的长期依赖性(ProtoPNet)已经引起广泛的关注,并增加了许多后续研究,原因是它对于可解释的人工智能(XAI)具有自我解释性属性。然而,当直接对视觉变压器(VIT)骨干直接应用 ProtoPNet 模型时,所学到的原型具有“分流”问题:它们具有相对较高的被背景激活的概率,对前景的注意力较少。模型化的长期依赖性能使得基于变压器的ProtoPNet难以专注于原型部分,从而严重地损害其内在的可解释性。本文提议了原型部分变压器(ProtoPFormer),以便适当和有效地将原型的原型方法与VIT一起应用到可解释的图像识别上。根据Vitus的建筑特征,这些原型具有较高的全球和地方原型,其整体和局部的特征。全球原型模型被采用是为了提供全球目标的视角,引导基于原型的原型在地面上集中关注地面,同时消除背景的影响力。随后,当地原型的原型原型(Protorotoal protoal practoral) practal-view(Protod) rocidudustrual) 和直观) 分别地的每个的原型和直观的每个的原型/colalalalalalalal-toimalimalalal-toutututal-s