Spatio-temporal scene-graph approaches to video-based reasoning tasks such as video question-answering (QA) typically construct such graphs for every video frame. Such approaches often ignore the fact that videos are essentially sequences of 2D "views" of events happening in a 3D space, and that the semantics of the 3D scene can thus be carried over from frame to frame. Leveraging this insight, we propose a (2.5+1)D scene graph representation to better capture the spatio-temporal information flows inside the videos. Specifically, we first create a 2.5D (pseudo-3D) scene graph by transforming every 2D frame to have an inferred 3D structure using an off-the-shelf 2D-to-3D transformation module, following which we register the video frames into a shared (2.5+1)D spatio-temporal space and ground each 2D scene graph within it. Such a (2.5+1)D graph is then segregated into a static sub-graph and a dynamic sub-graph, corresponding to whether the objects within them usually move in the world. The nodes in the dynamic graph are enriched with motion features capturing their interactions with other graph nodes. Next, for the video QA task, we present a novel transformer-based reasoning pipeline that embeds the (2.5+1)D graph into a spatio-temporal hierarchical latent space, where the sub-graphs and their interactions are captured at varied granularity. To demonstrate the effectiveness of our approach, we present experiments on the NExT-QA and AVSD-QA datasets. Our results show that our proposed (2.5+1)D representation leads to faster training and inference, while our hierarchical model showcases superior performance on the video QA task versus the state of the art.
翻译:Exptio- 时空场景映射方法 用于视频解答( QA) 等基于视频的推理任务, 通常为每个视频框架构建这样的图表。 此类方法往往忽略以下事实: 视频基本上是三维空间所发生事件的二维“ 视图” 序列, 3D场景的语义可以从框架向框架移动。 我们利用这一洞察力, 提议一个 ( 2.5+1) D 场景图演示, 以更好地捕捉视频内部的spatio- 时间信息流。 具体地说, 我们首先通过转换每2D 框架来创建 2.5D ( 假) 2.5D) 场景图图。 我们首先通过转换每2D 框架来创建 2.5D ( 假) Q- 三维( 假) 场景图图图图图图图图图图图图图图图图图。 QA 预示显示的是, 我们的直径图图图图图图图图图图图上显示的是我们当前的图像变化任务。