通过对称攻击对图像进行控制下生成 (Controlled Caption Generation for Images Through Adversarial Attacks)

Deep learning is found to be vulnerable to adversarial examples. However, its adversarial susceptibility in image caption generation is under-explored. We study adversarial examples for vision and language models, which typically adopt an encoder-decoder framework consisting of two major components: a Convolutional Neural Network (i.e., CNN) for image feature extraction and a Recurrent Neural Network (RNN) for caption generation. In particular, we investigate attacks on the visual encoder's hidden layer that is fed to the subsequent recurrent network. The existing methods either attack the classification layer of the visual encoder or they back-propagate the gradients from the language model. In contrast, we propose a GAN-based algorithm for crafting adversarial examples for neural image captioning that mimics the internal representation of the CNN such that the resulting deep features of the input image enable a controlled incorrect caption generation through the recurrent network. Our contribution provides new insights for understanding adversarial attacks on vision systems with language component. The proposed method employs two strategies for a comprehensive evaluation. The first examines if a neural image captioning system can be misled to output targeted image captions. The second analyzes the possibility of keywords into the predicted captions. Experiments show that our algorithm can craft effective adversarial images based on the CNN hidden layers to fool captioning framework. Moreover, we discover the proposed attack to be highly transferable. Our work leads to new robustness implications for neural image captioning.

翻译：深层次的学习被认为容易受到对抗性实例的影响。然而,在图像字幕生成过程中,其对抗性易感性在图像字幕生成中的敏感度没有得到充分探讨。我们研究视觉和语言模型的对抗性范例,这些范例通常采用由两个主要组成部分组成的编码器-代码器框架:一个用于图像特征提取的革命神经网络(CNN)和用于字幕生成的经常性神经网络(RNN)。特别是,我们调查视觉编码器的隐藏层受到攻击,并提供给随后的经常性网络。现有的方法要么攻击视觉编码器的分类层,要么对语言模型的梯度进行反向分析。相比之下,我们建议用基于GAN的逻辑算法来设计神经神经图像的对抗性范例,描述CNN的图像内部表现,从而使得输入图像的深度能够通过经常性网络进行有控制的不正确的字幕生成。我们的贡献为理解视觉系统与语言组件的对抗性攻击性攻击提供了新的洞察力。拟议方法使用两种战略来进行全面评价。如果一个神经图像解释系统能够误导到我们所隐藏的图像分析的图像结构,那么,那么,我们就可以进行第二次审查。能够分析我们用来分析。