Current approaches for designing self-explainable models (SEMs) require complicated training procedures and specific architectures which makes them impractical. With the advance of general purpose foundation models based on Vision Transformers (ViTs), this impracticability becomes even more problematic. Therefore, new methods are necessary to provide transparency and reliability to ViT-based foundation models. In this work, we present a new method for turning any well-trained ViT-based model into a SEM without retraining, which we call Keypoint Counting Classifiers (KCCs). Recent works have shown that ViTs can automatically identify matching keypoints between images with high precision, and we build on these results to create an easily interpretable decision process that is inherently visualizable in the input. We perform an extensive evaluation which show that KCCs improve the human-machine communication compared to recent baselines. We believe that KCCs constitute an important step towards making ViT-based foundation models more transparent and reliable.
翻译:当前设计自解释模型的方法需要复杂的训练流程和特定架构,这使得它们在实际应用中难以推广。随着基于视觉Transformer的通用基础模型的发展,这种不实用性变得更为突出。因此,有必要开发新方法为基于ViT的基础模型提供透明度和可靠性。本文提出一种新方法,可将任何训练良好的基于ViT的模型转化为自解释模型而无需重新训练,我们称之为关键点计数分类器。近期研究表明,ViT能够以高精度自动识别图像间的匹配关键点,我们基于这些成果构建了一种易于解释的决策过程,该过程可在输入空间中实现可视化。我们进行了广泛评估,结果表明与现有基线方法相比,KCCs显著改善了人机交互的可理解性。我们相信KCCs是推动基于ViT的基础模型向更高透明度和可靠性发展的重要一步。