A common approach to quantifying model interpretability is to calculate faithfulness metrics based on iteratively masking input tokens and measuring how much the predicted label changes as a result. However, we show that such metrics are generally not suitable for comparing the interpretability of different neural text classifiers as the response to masked inputs is highly model-specific. We demonstrate that iterative masking can produce large variation in faithfulness scores between comparable models, and show that masked samples are frequently outside the distribution seen during training. We further investigate the impact of adversarial attacks and adversarial training on faithfulness scores, and demonstrate the relevance of faithfulness measures for analyzing feature salience in text adversarial attacks. Our findings provide new insights into the limitations of current faithfulness metrics and key considerations to utilize them appropriately.
翻译:暂无翻译