We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning. VILLA consists of two training stages: (i) task-agnostic adversarial pre-training; followed by (ii) task-specific adversarial finetuning. Instead of adding adversarial perturbations on image pixels and textual tokens, we propose to perform adversarial training in the embedding space of each modality. To enable large-scale training, we adopt the "free" adversarial training strategy, and combine it with KL-divergence-based regularization to promote higher invariance in the embedding space. We apply VILLA to current best-performing V+L models, and achieve new state of the art on a wide range of tasks, including Visual Question Answering, Visual Commonsense Reasoning, Image-Text Retrieval, Referring Expression Comprehension, Visual Entailment, and NLVR2.
翻译:我们提出VILLA,这是关于视觉和语言(V+L)代表性学习的大规模对抗性培训的首个已知努力。VILLA由两个培训阶段组成:(一) 任务不可知性对抗性对抗性训练前训练;随后是任务特定对抗性微调。我们提议在每种模式的嵌入空间中进行对抗性训练,而不是在图像像素和文字符号上增加对抗性干扰。为了进行大规模训练,我们采用了“免费”对抗性训练战略,并将它与KL-Divegence-规范化结合起来,以促进在嵌入空间中出现更高的差异。我们将VILA应用到目前最优秀的V+L模型,并在广泛的任务中实现新的艺术状态,包括视觉问答、视觉常识、图像-图象检索、参考表达组合、视觉成像和NLVR2。