A growing line of work has investigated the development of neural NLP models that can produce rationales--subsets of input that can explain their model predictions. In this paper, we ask whether such rationale models can also provide robustness to adversarial attacks in addition to their interpretable nature. Since these models need to first generate rationales ("rationalizer") before making predictions ("predictor"), they have the potential to ignore noise or adversarially added text by simply masking it out of the generated rationale. To this end, we systematically generate various types of 'AddText' attacks for both token and sentence-level rationalization tasks, and perform an extensive empirical evaluation of state-of-the-art rationale models across five different tasks. Our experiments reveal that the rationale models show the promise to improve robustness, while they struggle in certain scenarios--when the rationalizer is sensitive to positional bias or lexical choices of attack text. Further, leveraging human rationale as supervision does not always translate to better performance. Our study is a first step towards exploring the interplay between interpretability and robustness in the rationalize-then-predict framework.
翻译:越来越多的工作已经调查了神经NLP模型的开发情况,这些模型可以产生解释其模型预测的参数子集。 在本文中,我们询问这些理论模型除了可以解释的性质外,是否还能为对抗性攻击提供强健性。由于这些模型需要首先产生理由(“理性”),在作出预测(“预测”)之前,它们有可能忽视噪音或对抗性增加的文字,简单地掩盖其产生的理由。为此,我们系统地为象征和判决一级的合理化任务产生各种“AddText”攻击,并对五种不同的任务中的最新理论模型进行广泛的实证评估。我们的实验表明,这些理论模型表明,当合理化者对定位偏差或攻击文本的词汇选择敏感时,它们在某些情景中挣扎着增强稳健性的承诺。此外,利用人类的理论作为监督并不总是转化为更好的表现。我们的研究是探索合理性框架的解释性和稳健性之间的相互作用的第一步。