Modern classification algorithms are susceptible to adversarial examples--perturbations to inputs that cause the algorithm to produce undesirable behavior. In this work, we seek to understand and extend adversarial examples across domains in which inputs are discrete, particularly across new domains, such as computational biology. As a step towards this goal, we formalize a notion of synonymous adversarial examples that applies in any discrete setting and describe a simple domain-agnostic algorithm to construct such examples. We apply this algorithm across multiple domains--including sentiment analysis and DNA sequence classification--and find that it consistently uncovers adversarial examples. We seek to understand their prevalence theoretically and we attribute their existence to spurious token correlations, a statistical phenomenon that is specific to discrete spaces. Our work is a step towards a domain-agnostic treatment of discrete adversarial examples analogous to that of continuous inputs.
翻译:现代分类算法很容易受到对抗性实例的干扰,从而导致这种算法产生不良行为。 在这项工作中,我们力求理解并扩展各种领域之间的对抗性例子,在这些领域中,投入是互不关联的,特别是在诸如计算生物学等新领域。作为实现这一目标的一个步骤,我们正式确定一个适用于任何离散环境的同义性对抗性例子的概念,并描述一个简单的域名性-不可知性算法来构建这些例子。我们在多个领域——包括情绪分析和DNA序列分类——中应用这种算法来构建这些例子,并发现它一贯发现对抗性例子。我们试图从理论上理解其普遍性,并将它们的存在归因于虚伪的象征性相关性,一种与离散空间特有的统计现象。我们的工作是朝着一种类似于持续输入的域性处理离散性对抗性例子而迈出的一步。