Video event extraction aims to detect salient events from a video and identify the arguments for each event as well as their semantic roles. Existing methods focus on capturing the overall visual scene of each frame, ignoring fine-grained argument-level information. Inspired by the definition of events as changes of states, we propose a novel framework to detect video events by tracking the changes in the visual states of all involved arguments, which are expected to provide the most informative evidence for the extraction of video events. In order to capture the visual state changes of arguments, we decompose them into changes in pixels within objects, displacements of objects, and interactions among multiple arguments. We further propose Object State Embedding, Object Motion-aware Embedding and Argument Interaction Embedding to encode and track these changes respectively. Experiments on various video event extraction tasks demonstrate significant improvements compared to state-of-the-art models. In particular, on verb classification, we achieve 3.49% absolute gains (19.53% relative gains) in F1@5 on Video Situation Recognition.
翻译:视频事件提取的目的是从视频中检测突出事件,并辨别每个事件及其语义作用的参数。现有方法侧重于捕捉每个框架的总体视觉场景,忽略细微的参数级信息。受事件定义作为状态变化的启发,我们提出了一个新框架,通过跟踪所有相关参数的视觉状态变化来检测视频事件,预计这些变化将为提取视频事件提供最丰富的证据。为了捕捉参数的视觉状态变化,我们将其分解为物体中的像素变化、物体移位和多个参数之间的相互作用。我们进一步建议对象州嵌入、物体运动意识嵌入和辩论互动嵌入分别编码和跟踪这些变化。对各种视频事件提取任务的实验表明,与最先进的模型相比,我们取得了3.49%的绝对收益(19.53%的相对收益 ) 关于视频状况识别的F1@5中3.49%的绝对收益。