With the ever-increasing complexity of neural language models, practitioners have turned to methods for understanding the predictions of these models. One of the most well-adopted approaches for model interpretability is feature-based interpretability, i.e., ranking the features in terms of their impact on model predictions. Several prior studies have focused on assessing the fidelity of feature-based interpretability methods, i.e., measuring the impact of dropping the top-ranked features on the model output. However, relatively little work has been conducted on quantifying the robustness of interpretations. In this work, we assess the robustness of interpretations of neural text classifiers, specifically, those based on pretrained Transformer encoders, using two randomization tests. The first compares the interpretations of two models that are identical except for their initializations. The second measures whether the interpretations differ between a model with trained parameters and a model with random parameters. Both tests show surprising deviations from expected behavior, raising questions about the extent of insights that practitioners may draw from interpretations.
翻译:随着神经语言模型日益复杂,实践者转向了理解这些模型预测的方法。模型解释性最常用的方法之一是基于特征的解释性,即按其对模型预测的影响排列特征。前几次研究侧重于评估基于特征的解释性方法的准确性,即测量降格对模型输出结果的影响。然而,在量化解释的稳健性方面开展的工作相对较少。在这项工作中,我们评估了神经文本分类器解释的稳健性,具体地说,那些基于预先培训的变换器编码器的解释,使用了两次随机测试。首先比较了两种模型的解释,这些解释除初始化外是相同的。第二项研究衡量了具有经过培训参数的模型和带有随机参数的模型之间的解释性。这两项测试都显示了与预期行为的惊人偏差,使人怀疑从业人员从解释中得出洞察力的程度。