时间的测试: 采用时间感输入视频语言模型 (Test of Time: Instilling Video-Language Models with a Sense of Time)

Modeling and understanding time remains a challenge in contemporary video understanding models. With language emerging as a key driver towards powerful generalization, it is imperative for foundational video-language models to have a sense of time. In this paper, we consider a specific aspect of temporal understanding: consistency of time order as elicited by before/after relations. We establish that six existing video-language models struggle to understand even such simple temporal relations. We then question whether it is feasible to equip these foundational models with temporal awareness without re-training them from scratch. Towards this, we propose a temporal adaptation recipe on top of one such model, VideoCLIP, based on post-pretraining on a small amount of video-text data. We conduct a zero-shot evaluation of the adapted models on six datasets for three downstream tasks which require a varying degree of time awareness. We observe encouraging performance gains especially when the task needs higher time awareness. Our work serves as a first step towards probing and instilling a sense of time in existing video-language models without the need for data and compute-intense training from scratch.

翻译：建模和理解时间仍然是当代视频理解模型中的一项挑战。语言正在成为强大普及的关键驱动因素,因此基础视频语言模型必须具有时间感。在本文中,我们考虑时间理解的一个具体方面:过去/以后的关系所产生的时间顺序的一致性。我们确定现有6个视频语言模型甚至难以理解这种简单的时间关系。然后我们质问,在不从零开始再培训的情况下,将这些基础模型配置为时间意识是否可行。为此,我们基于少量视频文本数据的预科后模型(视频CLIP)提出一个时间适应方案。我们对需要不同程度的时间认识的三个下游任务的6个数据集的调整模型进行零光评价。我们观察鼓励业绩收益,特别是在任务需要更高的时间意识时。我们的工作是朝着在现有视频语言模型中探索和灌输时间感迈出的第一步,而无需从零开始的数据和计算培训。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/