Adversarial vulnerability remains a major obstacle to constructing reliable NLP systems. When imperceptible perturbations are added to raw input text, the performance of a deep learning model may drop dramatically under attacks. Recent work argues the adversarial vulnerability of the model is caused by the non-robust features in supervised training. Thus in this paper, we tackle the adversarial robustness challenge from the view of disentangled representation learning, which is able to explicitly disentangle robust and non-robust features in text. Specifically, inspired by the variation of information (VI) in information theory, we derive a disentangled learning objective composed of mutual information to represent both the semantic representativeness of latent embeddings and differentiation of robust and non-robust features. On the basis of this, we design a disentangled learning network to estimate these mutual information. Experiments on text classification and entailment tasks show that our method significantly outperforms the representative methods under adversarial attacks, indicating that discarding non-robust features is critical for improving adversarial robustness.
翻译:在原始输入文本中添加不易察觉的扰动时,深层次学习模式的性能可能会在攻击中急剧下降。最近的工作认为,该模式的对抗性脆弱性是由监督培训中非野蛮特征造成的。因此,在本文中,我们从分解的表述学习的角度来应对对抗性强健挑战,这种学习能够明确分解稳健和不破裂的文字特征。具体地说,在信息理论中信息变异(VI)的启发下,我们得出了一个由相互信息组成的分解的学习目标,它由相互信息组成,既代表潜在嵌入的语义代表性,又代表强健健和非野蛮特征的区别。在此基础上,我们设计了一个分解的学习网络来估计这些相互信息。关于文本分类的实验和需要完成的任务表明,我们的方法大大超越了对抗性攻击下的代表性方法,表明抛弃非野蛮特征对于改善对抗性强健性至关重要。