Existing studies have demonstrated that adversarial examples can be directly attributed to the presence of non-robust features, which are highly predictive, but can be easily manipulated by adversaries to fool NLP models. In this study, we explore the feasibility of capturing task-specific robust features, while eliminating the non-robust ones by using the information bottleneck theory. Through extensive experiments, we show that the models trained with our information bottleneck-based method are able to achieve a significant improvement in robust accuracy, exceeding performances of all the previously reported defense methods while suffering almost no performance drop in clean accuracy on SST-2, AGNEWS and IMDB datasets.
翻译:现有研究表明,对抗性实例可以直接归因于非野蛮特征的存在,这些特征预测性强,但很容易被对手操纵来愚弄NLP模型。在本研究中,我们探索了捕捉具体任务强健特征的可行性,同时利用信息瓶颈理论消除非野蛮特征。我们通过广泛的实验表明,接受过以信息瓶颈为基础的方法培训的模型能够大大改进稳健的准确性,超过以前报告的所有防御方法的性能,同时在SST-2、AGNEWS和IMDB数据集方面几乎没有任何性能下降。