Despite significant advances, deep networks remain highly susceptible to adversarial attack. One fundamental challenge is that small input perturbations can often produce large movements in the network's final-layer feature space. In this paper, we define an attack model that abstracts this challenge, to help understand its intrinsic properties. In our model, the adversary may move data an arbitrary distance in feature space but only in random low-dimensional subspaces. We prove such adversaries can be quite powerful: defeating any algorithm that must classify any input it is given. However, by allowing the algorithm to abstain on unusual inputs, we show such adversaries can be overcome when classes are reasonably well-separated in feature space. We further provide strong theoretical guarantees for setting algorithm parameters to optimize over accuracy-abstention trade-offs using data-driven methods. Our results provide new robustness guarantees for nearest-neighbor style algorithms, and also have application to contrastive learning, where we empirically demonstrate the ability of such algorithms to obtain high robust accuracy with low abstention rates. Our model is also motivated by strategic classification, where entities being classified aim to manipulate their observable features to produce a preferred classification, and we provide new insights into that area as well.
翻译:尽管取得了重大进步,但深层次的网络仍然极易受到对抗性攻击。一个基本挑战是,小输入干扰往往会在网络的终极特征空间中产生巨大的移动。在本文中,我们定义了一种攻击模型,可以对挑战进行总结,帮助理解其内在特性。在我们的模型中,对手可以在特征空间中移动数据任意的距离,但只能在随机的低维次空间中移动。我们证明,这样的对手可以相当强大:击败任何必须对其输入进行分类的算法。然而,允许算法对不寻常输入进行分解,我们就能证明当分类在特征空间中合理分离时,这种对手是可以克服的。我们进一步提供了强有力的理论保证,以设定算法参数,以优化准确性偏差交易。我们的结果为近邻风格算法提供了新的稳健性保证,并应用了对比性学习,我们的经验证明,这种算法能够以低弃权率获得高度稳健的准确性。我们的模式还受到战略分类的驱动,在这种分类中,实体被分类的目的是调整其可观察性特征,以便产生可取的分类,我们提供了新的洞察力。