Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time. Recently, state-of-the-art video-language models based on CLIP have been shown to have limited verb understanding and to rely extensively on nouns, restricting their performance in real-world video applications that require action and temporal understanding. In this work, we improve verb understanding for CLIP-based video-language models by proposing a new Verb-Focused Contrastive (VFC) framework. This consists of two main components: (1) leveraging pretrained large language models (LLMs) to create hard negatives for cross-modal contrastive learning, together with a calibration strategy to balance the occurrence of concepts in positive and negative pairs; and (2) enforcing a fine-grained, verb phrase alignment loss. Our method achieves state-of-the-art results for zero-shot performance on three downstream tasks that focus on verb understanding: video-text matching, video question-answering and video classification. To the best of our knowledge, this is the first work which proposes a method to alleviate the verb understanding problem, and does not simply highlight it.
翻译:理解动词对于模拟人和物体如何通过空间和时间相互作用于环境非常重要。最近,基于CLIP的最先进的视频-语言模型被证明对于动词的理解能力有限且严重依赖于名词,限制了它们在需要动作和时间理解的现实世界视频应用程序中的性能。在本文中,我们提出了一种新的基于动词对比学习的方法(VFC),以改善基于CLIP的视频-语言模型对动词的理解能力。该方法包含两个主要组成部分:(1) 利用预训练的大语言模型(LLM)创建跨模态对比学习的硬负样本,结合一个校准策略来平衡正负对出现的概念;以及(2) 强制执行细粒度的动词短语对齐损失。我们的方法在三个注重动词理解的下游任务的零-shot表现上取得了最先进的结果:视频-文本匹配、视频问答和视频分类。据我们所知,这是第一篇提出缓解动词理解问题而不仅仅是突出表示问题的方法的论文。