We propose a new framework for understanding and representing related salient events in a video using visual semantic role labeling. We represent videos as a set of related events, wherein each event consists of a verb and multiple entities that fulfill various roles relevant to that event. To study the challenging task of semantic role labeling in videos or VidSRL, we introduce the VidSitu benchmark, a large-scale video understanding data source with $29K$ $10$-second movie clips richly annotated with a verb and semantic-roles every $2$ seconds. Entities are co-referenced across events within a movie clip and events are connected to each other via event-event relations. Clips in VidSitu are drawn from a large collection of movies (${\sim}3K$) and have been chosen to be both complex (${\sim}4.2$ unique verbs within a video) as well as diverse (${\sim}200$ verbs have more than $100$ annotations each). We provide a comprehensive analysis of the dataset in comparison to other publicly available video understanding benchmarks, several illustrative baselines and evaluate a range of standard video recognition models. Our code and dataset is available at vidsitu.org.
翻译:我们提出一个新的框架,用以利用视觉语义作用标签,在视频中理解和代表相关的突出事件。我们将视频作为一系列相关活动,其中每个事件由一个动词和多个实体组成,履行与该事件有关的各种作用。为了研究视频或VidSRL中的语义角色标签这一具有挑战性的任务,我们提出了VidSitu基准,一个大型视频理解数据源,每秒29K$10美元,配有丰富的动词和语义作用注释,每秒有超过100美元的动词和语义作用。各实体在电影剪辑中将各种事件相互参照,通过事件-事件关系将每个事件相互连接。VidSitu的剪页来自大量电影集($_sim}3KK$),被选为复杂($sim}4.2美元,在视频中独有的动词)以及多样化的电影剪辑,每秒都有超过100美元的插图。我们提供了与其他公开提供的视频理解基准比较的数据集的全面分析,在可获取的视频基准和标准的图像识别标准代码上,我们提供了几个演示标度基准和评估。