Despite significant advances, deep networks remain highly susceptible to adversarial attack. One fundamental challenge is that small input perturbations can often produce large movements in the network's final-layer feature space. In this paper, we define an attack model that abstracts this challenge, to help understand its intrinsic properties. In our model, the adversary may move data an arbitrary distance in feature space but only in random low-dimensional subspaces. We prove such adversaries can be quite powerful: defeating any algorithm that must classify any input it is given. However, by allowing the algorithm to abstain on unusual inputs, we show such adversaries can be overcome when classes are reasonably well-separated in feature space. We further provide strong theoretical guarantees for setting algorithm parameters to optimize over accuracy-abstention trade-offs using data-driven methods. Our results provide new robustness guarantees for nearest-neighbor style algorithms, and also have application to contrastive learning, where we empirically demonstrate the ability of such algorithms to obtain high robust accuracy with low abstention rates. Our model is also motivated by strategic classification, where entities being classified aim to manipulate their observable features to produce a preferred classification, and we provide new insights into that area as well.
翻译:尽管取得了显著进展,深度网络仍然极易受到对抗攻击。其中一个根本性挑战是,小的输入扰动往往可以在网络的最终层特征空间中产生大的移动。在本文中,我们定义了一个攻击模型来抽象这个挑战,以帮助理解其固有属性。在我们的模型中,对手可以在特征空间的任意距离内移动数据,但仅限于随机低维子空间。我们证明这样的对手可以非常强大:击败必须对其给定的任何输入进行分类的任何算法。但是,通过允许算法对不寻常的输入弃权,我们展示了当特征空间中的类别相当好地分离时,这样的对手可以被克服。我们进一步提供了设置算法参数以使用数据驱动方法实现准确度-弃权权衡的强有力理论保证。我们的结果为最近邻样式算法提供了新的强健性保证,并且还适用于对比学习,在那里我们经验性地展示了这些算法获得高强健性准确率和低弃权率的能力。我们的模型也受到战略分类的启示,被分类的实体旨在操纵其可观察特征以产生首选分类,我们在那个领域提供了新的见解。