FireBERT: 强化基于BERT的分类,防止对抗性攻击 (FireBERT: Hardening BERT-based classifiers against adversarial attack)

We present FireBERT, a set of three proof-of-concept NLP classifiers hardened against TextFooler-style word-perturbation by producing diverse alternatives to original samples. In one approach, we co-tune BERT against the training data and synthetic adversarial samples. In a second approach, we generate the synthetic samples at evaluation time through substitution of words and perturbation of embedding vectors. The diversified evaluation results are then combined by voting. A third approach replaces evaluation-time word substitution with perturbation of embedding vectors. We evaluate FireBERT for MNLI and IMDB Movie Review datasets, in the original and on adversarial examples generated by TextFooler. We also test whether TextFooler is less successful in creating new adversarial samples when manipulating FireBERT, compared to working on unhardened classifiers. We show that it is possible to improve the accuracy of BERT-based models in the face of adversarial attacks without significantly reducing the accuracy for regular benchmark samples. We present co-tuning with a synthetic data generator as a highly effective method to protect against 95% of pre-manufactured adversarial samples while maintaining 98% of original benchmark performance. We also demonstrate evaluation-time perturbation as a promising direction for further research, restoring accuracy up to 75% of benchmark performance for pre-made adversarials, and up to 65% (from a baseline of 75% orig. / 12% attack) under active attack by TextFooler.

翻译：我们提出FireBERT, 一组由三组受控的NLP 分解器组成的FireBERT, 一组由三组受TextFooler式字型字调调调制的NLP 分解器组成,通过制作原始样本的多种替代品,与原始样本和由TextFooler生成的对称示例。在一种方法中,我们用培训数据和对称样本的原始样本和对称示例来对TextFOler式字调调调调制。在一种方法中,我们用培训数据和对称来对原始样本进行对比。在使用FireBERT时,我们用替换单词和对称矢量的矢量调制调制成合成词取代了NTexter的词代代词。我们用FERT(F) 以MDBERT和 IMDB MB 电影审查数据集来评价。在TextFooler的原始和对立式例子中,我们用一种高度有效的方法来测试TextFooler是否成功创建新的对立样本, 95 % 。我们通过测试前的原基准测试, 继续使用一种对称基准,然后用一种对称基准测试方法来保护98%。