Neural networks have achieved the state-of-the-art performance in various machine learning fields, yet the incorporation of malicious perturbations with input data (adversarial example) is shown to fool neural networks' predictions. This would lead to potential risks for real-world applications such as endangering autonomous driving and messing up text identification. To mitigate such risks, an understanding of how adversarial examples operate is critical, which however remains unresolved. Here we demonstrate that adversarial perturbations contain human-recognizable information, which is the key conspirator responsible for a neural network's erroneous prediction, in contrast to a widely discussed argument that human-imperceptible information plays the critical role in fooling a network. This concept of human-recognizable information allows us to explain key features related to adversarial perturbations, including the existence of adversarial examples, the transferability among different neural networks, and the increased neural network interpretability for adversarial training. Two unique properties in adversarial perturbations that fool neural networks are uncovered: masking and generation. A special class, the complementary class, is identified when neural networks classify input images. The human-recognizable information contained in adversarial perturbations allows researchers to gain insight on the working principles of neural networks and may lead to develop techniques that detect/defense adversarial attacks.
翻译:在各种机器学习领域,神经网络达到了最先进的表现,然而,将恶意扰动与输入数据相结合(对抗性实例),这证明是欺骗神经网络预测的假设。这将导致现实世界应用的潜在风险,例如危及自主驱动和干扰文本识别。为了减轻这种风险,了解对抗性实例是如何运作的至关重要,但这种风险仍未解决。我们在这里表明,对立性扰动包含可识别的人类信息,这是造成神经网络错误预测的关键搭建者,与广泛讨论的关于人类不可见信息在愚弄网络方面发挥关键作用的论点形成对照。这种人类可识别性信息的概念使我们能够解释与对抗性干扰有关的关键特征,包括存在对抗性实例,不同神经网络之间的可转移性,以及增强对抗性培训的神经网络解释性。在对抗性对立性透析中发现两个独特的特性,即:掩蔽和生成。在对立性网络中,可识别性信息的特殊类别、可识别的可辨识辨识性类别,在对立性研究网络进行分类时,可以将人际攻击性图像纳入。