Several machine learning models, including neural networks, consistently misclassify adversarial examples---inputs formed by applying small but intentionally worst-case perturbations to examples from the dataset, such that the perturbed input results in the model outputting an incorrect answer with high confidence. Early attempts at explaining this phenomenon focused on nonlinearity and overfitting. We argue instead that the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature. This explanation is supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets. Moreover, this view yields a simple and fast method of generating adversarial examples. Using this approach to provide examples for adversarial training, we reduce the test set error of a maxout network on the MNIST dataset.
翻译:包括神经网络在内的若干机器学习模型,一贯错误地将对抗性实例 -- -- 通过对数据集中的例子应用小的但故意最坏的情况扰动而形成的投入 -- -- 归类不当,因此,干扰性输入导致模型以高度自信输出错误答案。早期解释这一现象的尝试侧重于非线性和过度装配。相反,我们争辩说,神经网络易受对抗性扰动影响的主要原因是其线性。这一解释得到了新的量化结果的支持,同时首先解释了最令人感兴趣的事实:这些模型在建筑和培训组合中的一般化。此外,这种观点提出了产生对抗性实例的简单而快速的方法。利用这一方法为对抗性培训提供实例,我们减少了MNIST数据集上最大值网络的测试错误。