Modeling and understanding time remains a challenge in contemporary video understanding models. With language emerging as a key driver towards powerful generalization, it is imperative for foundational video-language models to have a sense of time. In this paper, we consider a specific aspect of temporal understanding: consistency of time order as elicited by before/after relations. We establish that six existing video-language models struggle to understand even such simple temporal relations. We then question whether it is feasible to equip these foundational models with temporal awareness without re-training them from scratch. Towards this, we propose a temporal adaptation recipe on top of one such model, VideoCLIP, based on post-pretraining on a small amount of video-text data. We conduct a zero-shot evaluation of the adapted models on six datasets for three downstream tasks which require a varying degree of time awareness. We observe encouraging performance gains especially when the task needs higher time awareness. Our work serves as a first step towards probing and instilling a sense of time in existing video-language models without the need for data and compute-intense training from scratch.
翻译:建模和理解时间仍然是当代视频理解模型中的一项挑战。 语言正在成为强大普及的关键驱动因素,因此基础视频语言模型必须具有时间感。 在本文中,我们考虑时间理解的一个具体方面:过去/以后的关系所产生的时间顺序的一致性。 我们确定现有6个视频语言模型甚至难以理解这种简单的时间关系。 然后我们质问,在不从零开始再培训的情况下,将这些基础模型配置为时间意识是否可行。 为此,我们基于少量视频文本数据的预科后模型(视频CLIP)提出一个时间适应方案。 我们对需要不同程度的时间认识的三个下游任务的6个数据集的调整模型进行零光评价。我们观察鼓励业绩收益,特别是在任务需要更高的时间意识时。我们的工作是朝着在现有视频语言模型中探索和灌输时间感迈出的第一步,而无需从零开始的数据和计算培训。