There has been recently a growing interest in studying adversarial examples on natural language models in the black-box setting. These methods attack natural language classifiers by perturbing certain important words until the classifier label is changed. In order to find these important words, these methods rank all words by importance by querying the target model word by word for each input sentence, resulting in high query inefficiency. A new interesting approach was introduced that addresses this problem through interpretable learning to learn the word ranking instead of previous expensive search. The main advantage of using this approach is that it achieves comparable attack rates to the state-of-the-art methods, yet faster and with fewer queries, where fewer queries are desirable to avoid suspicion towards the attacking agent. Nonetheless, this approach sacrificed the useful information that could be leveraged from the target classifier for that sake of query efficiency. In this paper we study the effect of leveraging the target model outputs and data on both attack rates and average number of queries, and we show that both can be improved, with a limited overhead of additional queries.
翻译:最近人们越来越有兴趣在黑盒设置中研究自然语言模型的对抗性实例。这些方法在改变分类标签之前通过干扰某些重要字来攻击自然语言分类者。为了找到这些重要字,这些方法通过对每个输入句逐字询问目标模式字来将所有字按重要性排列,导致高度低效率问题。采用了一种新的有趣方法,通过可解释的学习来解决这一问题,学习词级,而不是以往昂贵的搜索。使用这种方法的主要好处是,它达到与最先进方法相似的攻击率,但速度更快,而且查询较少,因此,为了避免对攻击代理人的怀疑,查询较少。不过,为了查询效率,这种方法牺牲了目标分类员可以利用的有用信息。在这份文件中,我们研究了利用目标模型产出和数据对攻击率和平均查询次数的影响。我们表明,两者都可以改进,额外查询的间接间接影响有限。