Saliency methods provide post-hoc model interpretation by attributing input features to the model outputs. Current methods mainly achieve this using a single input sample, thereby failing to answer input-independent inquiries about the model. We also show that input-specific saliency mapping is intrinsically susceptible to misleading feature attribution. Current attempts to use 'general' input features for model interpretation assume access to a dataset containing those features, which biases the interpretation. Addressing the gap, we introduce a new perspective of input-agnostic saliency mapping that computationally estimates the high-level features attributed by the model to its outputs. These features are geometrically correlated, and are computed by accumulating model's gradient information with respect to an unrestricted data distribution. To compute these features, we nudge independent data points over the model loss surface towards the local minima associated by a human-understandable concept, e.g., class label for classifiers. With a systematic projection, scaling and refinement process, this information is transformed into an interpretable visualization without compromising its model-fidelity. The visualization serves as a stand-alone qualitative interpretation. With an extensive evaluation, we not only demonstrate successful visualizations for a variety of concepts for large-scale models, but also showcase an interesting utility of this new form of saliency mapping by identifying backdoor signatures in compromised classifiers.
翻译:显著性方法通过将输入特征归因于模型输出来提供贴后模型解释。当前方法主要使用单个输入样本来实现这一点,从而未能回答关于模型的输入无关的问题。我们还表明,基于输入的特定显著性映射本质上容易受误导的特征归属影响。当前尝试使用“一般”输入特征进行模型解释的方法假设有一个包含这些特征的数据集,这会影响解释的结果。针对这一缺口,我们引入了一种新的输入无关显著性映射视角,该视角通过累积模型关于无约束数据分布的梯度信息来计算模型输出所归属的高级特征。这些特征在几何上相关,并通过将独立数据点沿着人类可理解的概念(例如,分类器类标签)对应的局部最小值推向模型损失表面,以计算这些特征。通过系统的投影,缩放和改进过程,将这些信息转换成可解释的可视化效果,同时不损害其模型保真度。这种可视化效果作为一个独立的定性解释。通过广泛的评估,我们不仅展示了大型模型各种概念的成功可视化,而且展示了这种新形式的显著性映射的一个有趣的实用程序,即识别被攻击的分类器中的后门信号。