Language-image pre-training (LIP) enables the development of vision-language models capable of zero-shot classification, localization, multimodal retrieval, and semantic understanding. Various explanation methods have been proposed to visualize the importance of input image-text pairs on the model's similarity outputs. However, popular saliency maps are limited by capturing only first-order attributions, overlooking the complex cross-modal interactions intrinsic to such encoders. We introduce faithful interaction explanations of LIP models (FIxLIP) as a unified approach to decomposing the similarity in vision-language encoders. FIxLIP is rooted in game theory, where we analyze how using the weighted Banzhaf interaction index offers greater flexibility and improves computational efficiency over the Shapley interaction quantification framework. From a practical perspective, we propose how to naturally extend explanation evaluation metrics, such as the pointing game and area between the insertion/deletion curves, to second-order interaction explanations. Experiments on the MS COCO and ImageNet-1k benchmarks validate that second-order methods, such as FIxLIP, outperform first-order attribution methods. Beyond delivering high-quality explanations, we demonstrate the utility of FIxLIP in comparing different models, e.g. CLIP vs. SigLIP-2.
翻译:语言-图像预训练(LIP)推动了视觉-语言模型的发展,使其能够实现零样本分类、定位、多模态检索和语义理解。现有研究提出了多种解释方法,用于可视化输入图像-文本对在模型相似性输出中的重要性。然而,流行的显著性映射仅能捕捉一阶归因,忽略了此类编码器固有的复杂跨模态交互作用。本文提出LIP模型的忠实交互解释方法(FIxLIP),作为分解视觉-语言编码器相似性的统一框架。FIxLIP植根于博弈论,通过分析加权Banzhaf交互指数相较于Shapley交互量化框架如何提供更高灵活性并提升计算效率。从实践角度,我们提出将解释评估指标(如指向性游戏和插入/删除曲线间面积)自然扩展至二阶交互解释的方法。在MS COCO和ImageNet-1k基准测试上的实验表明,以FIxLIP为代表的二阶方法优于一阶归因方法。除提供高质量解释外,我们进一步展示了FIxLIP在模型比较中的实用性,例如对比CLIP与SigLIP-2模型。