Despite the strong performance of current NLP models, they can be brittle against adversarial attacks. To enable effective learning against adversarial inputs, we introduce the use of rationale models that can explicitly learn to ignore attack tokens. We find that the rationale models can successfully ignore over 90% of attack tokens. This approach leads to consistent sizable improvements ($\sim$10%) over baseline models in robustness on three datasets for both BERT and RoBERTa, and also reliably outperforms data augmentation with adversarial examples alone. In many cases, we find that our method is able to close the gap between model performance on a clean test set and an attacked test set and hence reduce the effect of adversarial attacks.
翻译:尽管目前NLP模式表现良好,但它们可能会对对抗性攻击起到阻碍作用。为了能够有效学习对抗性投入,我们引入了理论模型的使用,这些理论模型可以明确学会忽略攻击物证。我们发现,理论模型可以成功地忽略90%以上的攻击物证。 这种方法使BERT和RoBERTA三个数据集的基准模型的可靠性得到持续显著改善($sim$10% ), 并且也可靠地比数据增强能力强,仅靠对抗性例子即可。 在许多情况下,我们发现我们的方法能够弥合清洁测试组和攻击性试验组模型性能之间的差距,从而减少对抗性攻击的效果。