There are two cases describing how a classifier processes input text, namely, misclassification and correct classification. In terms of misclassified texts, a classifier handles the texts with both incorrect predictions and adversarial texts, which are generated to fool the classifier, which is called a victim. Both types are misunderstood by the victim, but they can still be recognized by other classifiers. This induces large gaps in predicted probabilities between the victim and the other classifiers. In contrast, text correctly classified by the victim is often successfully predicted by the others and induces small gaps. In this paper, we propose an ensemble model based on similarity estimation of predicted probabilities (SEPP) to exploit the large gaps in the misclassified predictions in contrast to small gaps in the correct classification. SEPP then corrects the incorrect predictions of the misclassified texts. We demonstrate the resilience of SEPP in defending and detecting adversarial texts through different types of victim classifiers, classification tasks, and adversarial attacks.
翻译:有两种情况说明分类者如何处理输入文本,即分类错误和正确分类。在分类错误的文本方面,分类者处理文本时既使用不正确的预测,又使用对抗性文本,这是为了愚弄分类者,即称为受害者。两种类型都为受害人所误解,但其他分类者仍然可以识别。这在受害人与其他分类者之间造成预期概率的巨大差距。相反,受害者正确分类的文本往往由其他人成功预测,并造成很小的空白。在本文中,我们提出了一个基于对预测概率的类似估计(SEPP)的混合模型,以利用分类错误预测中的巨大差距,而正确分类中的小差距。然后,SEPP纠正对错误分类文本的错误预测。我们通过不同类型的受害人分类者、分类任务和对抗性攻击来证明SEPP在保护和发现对抗性文本方面的弹性。