In explainable AI, Concept Activation Vectors (CAVs) are typically obtained by training linear classifier probes to detect human-understandable concepts as directions in the activation space of deep neural networks. It is widely assumed that a high probe accuracy indicates a CAV faithfully representing its target concept. However, we show that the probe's classification accuracy alone is an unreliable measure of concept alignment, i.e., the degree to which a CAV captures the intended concept. In fact, we argue that probes are more likely to capture spurious correlations than they are to represent only the intended concept. As part of our analysis, we demonstrate that deliberately misaligned probes constructed to exploit spurious correlations, achieve an accuracy close to that of standard probes. To address this severe problem, we introduce a novel concept localization method based on spatial linear attribution, and provide a comprehensive comparison of it to existing feature visualization techniques for detecting and mitigating concept misalignment. We further propose three classes of metrics for quantitatively assessing concept alignment: hard accuracy, segmentation scores, and augmentation robustness. Our analysis shows that probes with translation invariance and spatial alignment consistently increase concept alignment. These findings highlight the need for alignment-based evaluation metrics rather than probe accuracy, and the importance of tailoring probes to both the model architecture and the nature of the target concept.
翻译:在可解释人工智能中,概念激活向量通常通过训练线性分类器探针来获取,这些探针将人类可理解的概念检测为深度神经网络激活空间中的方向。普遍认为,探针的高分类精度表明CAV忠实地代表了其目标概念。然而,我们发现仅凭探针的分类精度是概念对齐(即CAV捕捉预期概念的程度)的不可靠度量。事实上,我们认为探针更可能捕捉虚假相关性,而非仅代表预期概念。在我们的分析中,我们证明,为利用虚假相关性而故意构建的未对齐探针,其精度接近标准探针。为解决这一严重问题,我们引入了一种基于空间线性归因的新概念定位方法,并将其与现有的特征可视化技术进行全面比较,以检测和缓解概念未对齐问题。我们进一步提出了三类用于定量评估概念对齐的度量:硬精度、分割分数和增强鲁棒性。我们的分析表明,具有平移不变性和空间对齐的探针能持续提升概念对齐。这些发现强调了需要基于对齐的评估度量而非探针精度,以及根据模型架构和目标概念的性质定制探针的重要性。