Natural language video localization (NLVL) is an important task in the vision-language understanding area, which calls for an in-depth understanding of not only computer vision and natural language side alone, but more importantly the interplay between both sides. Adversarial vulnerability has been well-recognized as a critical security issue of deep neural network models, which requires prudent investigation. Despite its extensive yet separated studies in video and language tasks, current understanding of the adversarial robustness in vision-language joint tasks like NLVL is less developed. This paper therefore aims to comprehensively investigate the adversarial robustness of NLVL models by examining three facets of vulnerabilities from both attack and defense aspects. To achieve the attack goal, we propose a new adversarial attack paradigm called synonymous sentences-aware adversarial attack on NLVL (SNEAK), which captures the cross-modality interplay between the vision and language sides.
翻译:自然语言视频本地化(NLVL)是视觉语言理解领域的一项重要任务,它不仅要求深入了解计算机视觉和自然语言方面本身,而且更重要的是了解双方的相互作用。反向脆弱性已被公认为深神经网络模型的一个关键安全问题,需要谨慎调查。尽管它广泛而分开地研究了视频和语言任务,但目前对像NLVL这样的视觉语言联合任务中对抗性强健性的理解不够。因此,本文件的目的是通过研究攻击和防御两方面的弱点的三个方面来全面调查NLVL模式的对抗性强健性。为了实现攻击目标,我们提出了一个新的对抗性攻击模式,称为对NLVL(SNEAK)的同义性判决和对抗性攻击(NLVL(SNEAK)的同义性攻击,它捕捉到了视觉和语言两侧的交互作用。