Accurate video understanding involves reasoning about the relationships between actors, objects and their environment, often over long temporal intervals. In this paper, we propose a message passing graph neural network that explicitly models these spatio-temporal relations and can use explicit representations of objects, when supervision is available, and implicit representations otherwise. Our formulation generalises previous structured models for video understanding, and allows us to study how different design choices in graph structure and representation affect the model's performance. We demonstrate our method on two different tasks requiring relational reasoning in videos -- spatio-temporal action detection on AVA and UCF101-24, and video scene graph classification on the recent Action Genome dataset -- and achieve state-of-the-art results on all three datasets. Furthermore, we show quantitatively and qualitatively how our method is able to more effectively model relationships between relevant entities in the scene.
翻译:准确的视频理解涉及行为者、对象及其环境之间的关系的推理,通常时间间隔较长。 在本文中,我们提议建立一个传递信息的信息的图形神经网络,明确模拟这些时空关系,在有监督的情况下可以使用物体的清晰表达方式,也可以使用隐含的表达方式。 我们的提法概括了先前结构化的视频理解模式,并使我们能够研究图形结构和表达方式的不同设计选择如何影响模型的性能。 我们展示了我们在需要视频相关推理的两种不同任务上的方法:AVA和UCF101-24的时空动作探测和最近行动基因组数据集的视频场景图分类,以及所有三个数据集的最新结果。此外,我们从数量上和质量上展示了我们的方法如何能够更有效地模拟现场相关实体之间的关系。