How do humans recognize the action "opening a book" ? We argue that there are two important cues: modeling temporal shape dynamics and modeling functional relationships between humans and objects. In this paper, we propose to represent videos as space-time region graphs which capture these two important cues. Our graph nodes are defined by the object region proposals from different frames in a long range video. These nodes are connected by two types of relations: (i) similarity relations capturing the long range dependencies between correlated objects and (ii) spatial-temporal relations capturing the interactions between nearby objects. We perform reasoning on this graph representation via Graph Convolutional Networks. We achieve state-of-the-art results on both Charades and Something-Something datasets. Especially for Charades, we obtain a huge 4.4% gain when our model is applied in complex environments.
翻译:人类如何认识“ 打开一本书”的行动? 我们争论说, 有两个重要的提示: 模拟时间形状动态和模拟人类与物体之间的功能关系。 在本文中, 我们提议将视频作为时空区域图显示, 以捕捉这两个重要提示。 我们的图形节点是由目标区域在远程视频中从不同框中的建议定义的。 这些节点有两种类型的关系:(一) 相似关系, 捕捉相关对象之间的长距离依赖性, (二) 空间时际关系, 捕捉附近物体之间的相互作用。 我们通过图集进化网络对图表进行推理。 我们实现了在字符组和某些东西的计算数据集上的最新结果。 特别是对于字符组来说, 当我们的模型在复杂的环境中应用时, 我们获得了4. 4 % 的巨大收益 。