Work to date on language-informed video understanding has primarily addressed two tasks: (1) video question answering using multiple-choice questions, where models perform relatively well because they exploit the fact that candidate answers are readily available; and (2) video captioning, which relies on an open-ended evaluation framework that is often inaccurate because system answers may be perceived as incorrect if they differ in form from the ground truth. In this paper, we propose fill-in-the-blanks as a video understanding evaluation framework that addresses these previous evaluation drawbacks, and more closely reflects real-life settings where no multiple choices are given. The task tests a system understanding of a video by requiring the model to predict a masked noun phrase in the caption of the video, given the video and the surrounding text. We introduce a novel dataset consisting of 28,000 videos and fill-in-the-blank tests. We show that both a multimodal model and a strong language model have a large gap with human performance, thus suggesting that the task is more challenging than current video understanding benchmarks.
翻译:至今为止,关于语言知情视频理解的工作主要涉及两项任务:(1) 使用多种选择问题的视频回答问题,模型表现相对较好,因为它们利用候选人的答案很容易获得这一事实;(2) 视频字幕,它依赖于一个开放式评价框架,这种框架往往不准确,因为系统答案如果形式与地面真相不同,就可能被视为不正确。在本文中,我们提议填充空白作为视频理解评价框架,以解决这些先前的评价缺陷,更密切地反映没有作出多重选择的真实生活环境。任务测试了对视频的系统理解,要求模型根据视频和周围文本预测视频标题中的隐含名词。我们引入了由28 000个视频和填充空白测试组成的新数据集。我们显示,多式联运模式和强势语言模式与人类业绩存在很大差距,因此表明任务比当前视频理解基准更具挑战性。