Recent works have shown explainability and robustness are two crucial ingredients of trustworthy and reliable text classification. However, previous works usually address one of two aspects: i) how to extract accurate rationales for explainability while being beneficial to prediction; ii) how to make the predictive model robust to different types of adversarial attacks. Intuitively, a model that produces helpful explanations should be more robust against adversarial attacks, because we cannot trust the model that outputs explanations but changes its prediction under small perturbations. To this end, we propose a joint classification and rationale extraction model named AT-BMC. It includes two key mechanisms: mixed Adversarial Training (AT) is designed to use various perturbations in discrete and embedding space to improve the model's robustness, and Boundary Match Constraint (BMC) helps to locate rationales more precisely with the guidance of boundary information. Performances on benchmark datasets demonstrate that the proposed AT-BMC outperforms baselines on both classification and rationale extraction by a large margin. Robustness analysis shows that the proposed AT-BMC decreases the attack success rate effectively by up to 69%. The empirical results indicate that there are connections between robust models and better explanations.
翻译:最近的工作表明,解释性和稳健性是可信和可靠的文本分类的两个关键要素。然而,以往的工作通常涉及两个方面之一:(一) 如何在有利于预测的同时获取解释的准确理由;(二) 如何使预测模型对不同类型的对抗性攻击强有力。直觉来看,提出有用解释的模型应当更有力地对付对抗性攻击,因为我们无法相信产出解释的模式,但在小扰动下改变其预测。为此,我们提议了一个称为AT-BMC的联合分类和理由提取模型。它包括两个关键机制:混合反向培训(AT)旨在利用离散和嵌入空间的各种扰动来改进模型的稳健性,边界匹配控制(BMC)有助于更准确地根据边界信息的指导确定理由。基准数据集的绩效表明,拟议的AT-BMC在分类和理由提取两方面都比大幅度的基线都好。Robustness分析显示,拟议的AT-BMC成功率通过强有力的解释有效地降低到69 %。 实验结果显示,AT-BMC成功率将攻击率降低到更好的联系。