We propose Video Localized Narratives, a new form of multimodal video annotations connecting vision and language. In the original Localized Narratives, annotators speak and move their mouse simultaneously on an image, thus grounding each word with a mouse trace segment. However, this is challenging on a video. Our new protocol empowers annotators to tell the story of a video with Localized Narratives, capturing even complex events involving multiple actors interacting with each other and with several passive objects. We annotated 20k videos of the OVIS, UVO, and Oops datasets, totalling 1.7M words. Based on this data, we also construct new benchmarks for the video narrative grounding and video question answering tasks, and provide reference results from strong baseline models. Our annotations are available at https://google.github.io/video-localized-narratives/.
翻译:我们提议了视频本地化叙述,这是一种将视觉和语言连接起来的新形式的多式视频描述;在原有的本地化叙述中,告示员在图像上同时说话和移动鼠标鼠标,从而用鼠标痕量部分作为每个单词的基础;然而,这在视频上具有挑战性。我们的新协议授权告示员用本地化叙述来讲述一段视频故事,捕捉涉及多个行为体相互互动和若干被动对象的复杂事件。我们为OVIS、UVO和Ego数据集附加了20k视频,总计1.7M字。根据这些数据,我们还为视频描述地面和视频问题回答任务制定了新的基准,并提供了强有力的基线模型的参考结果。我们的说明可在https://goagle.gitub.io/dovic-localized-narturations/上查阅。</s>