Captioning is a crucial and challenging task for video understanding. In videos that involve active agents such as humans, the agent's actions can bring about myriad changes in the scene. Observable changes such as movements, manipulations, and transformations of the objects in the scene, are reflected in conventional video captioning. Unlike images, actions in videos are also inherently linked to social aspects such as intentions (why the action is taking place), effects (what changes due to the action), and attributes that describe the agent. Thus for video understanding, such as when captioning videos or when answering questions about videos, one must have an understanding of these commonsense aspects. We present the first work on generating commonsense captions directly from videos, to describe latent aspects such as intentions, effects, and attributes. We present a new dataset "Video-to-Commonsense (V2C)" that contains $\sim9k$ videos of human agents performing various actions, annotated with 3 types of commonsense descriptions. Additionally we explore the use of open-ended video-based commonsense question answering (V2C-QA) as a way to enrich our captions. Both the generation task and the QA task can be used to enrich video captions.
翻译:视频理解是一项关键且具有挑战性的任务。 在涉及像人类这样的活跃代理物的视频中,该代理物的行动可以带来众多的场景变化。 常规视频字幕中反映了现场物体移动、操纵和变换等可见的变化。 与图像不同, 视频中的行动也与意图( 行动为何发生)、 效果( 行动带来何种变化) 和描述该代理物的属性等社会方面有着内在的联系。 因此, 在视频理解方面,例如字幕视频或回答视频问题时,人们必须了解这些常识问题。 我们介绍关于直接从视频生成常识说明,描述意图、效果和属性等潜在方面的首次工作。 我们推出一个新的数据集“ Video- to-Commonsense (V2C) ”, 其中包含用于开展各种行动的人类代理物剂的$\sim9k$的视频, 附加3种常见描述。 此外,我们探索如何使用开放式视频常见解答( V2C-QA) 来丰富我们制作的视频任务。