Real-world scenarios often require the anticipation of object interactions in unknown future, which would assist the decision-making process of both humans and agents. To meet this challenge, we present a new task named Visual Relationship Forecasting (VRF) in videos to explore the prediction of visual relationships in a reasoning manner. Specifically, given a subject-object pair with H existing frames, VRF aims to predict their future interactions for the next T frames without visual evidence. To evaluate the VRF task, we introduce two video datasets named VRF-AG and VRF-VidOR, with a series of spatio-temporally localized visual relation annotations in a video. These two datasets densely annotate 13 and 35 visual relationships in 1923 and 13447 video clips, respectively. In addition, we present a novel Graph Convolutional Transformer (GCT) framework, which captures both object-level and frame-level dependencies by spatio-temporal Graph Convolution Network and Transformer. Experimental results on both VRF-AG and VRF-VidOR datasets demonstrate that GCT outperforms the state-of-the-art sequence modelling methods on visual relationship forecasting.
翻译:为了应对这一挑战,我们在视频中展示了一个新的任务,即视觉关系预测(VRF),以探索以推理方式预测视觉关系。具体地说,鉴于一个主题对象和H现有框架的配对,VRF旨在预测下一个T框架的未来相互作用,而没有视觉证据。为了评估VRF的任务,我们引入了两个视频数据集,分别名为VRF-AG和VRF-VidOR, 并配有一系列spatio-Pentoral-impoly本地化视觉关系说明。这两个数据集在1923年和13447年分别是高度注13和35视觉关系。此外,我们提出了一个新的图表革命变形器框架,它通过时势变形图网络和变形器捕捉到目标级别和框架级别两方面的相互作用。 VRF-AG和VRF-VidOR-VidOR数据集的实验结果显示,GCT在视觉关系模型序列上超越了视觉关系的预测。