Dynamic scene graph generation aims at generating a scene graph of the given video. Compared to the task of scene graph generation from images, it is more challenging because of the dynamic relationships between objects and the temporal dependencies between frames allowing for a richer semantic interpretation. In this paper, we propose Spatial-temporal Transformer (STTran), a neural network that consists of two core modules: (1) a spatial encoder that takes an input frame to extract spatial context and reason about the visual relationships within a frame, and (2) a temporal decoder which takes the output of the spatial encoder as input in order to capture the temporal dependencies between frames and infer the dynamic relationships. Furthermore, STTran is flexible to take varying lengths of videos as input without clipping, which is especially important for long videos. Our method is validated on the benchmark dataset Action Genome (AG). The experimental results demonstrate the superior performance of our method in terms of dynamic scene graphs. Moreover, a set of ablative studies is conducted and the effect of each proposed module is justified.
翻译:动态图形生成的目的是生成给定视频的场景图。 与图像的场景图生成任务相比, 它更具挑战性, 因为天体之间的动态关系和框架之间的时间依赖关系使得语义解释更加丰富。 在本文中, 我们提议空间时空变换器(STTran), 由两个核心模块组成的神经网络:(1) 空间编码器, 使用一个输入框架框架来提取空间背景和视觉关系的理由, (2) 将空间编码器的输出作为输入, 以捕捉各框架之间的时间依赖关系, 并推断动态关系。 此外, STTran灵活地将不同长度的视频作为输入, 而不剪动, 这对于长视频尤其重要。 我们的方法在基准数据集动作基因组( AG) (AG) 上得到验证。 实验结果显示我们的方法在动态场景图形方面的优异性表现。 此外, 进行了一系列模拟研究, 并且每个拟议模块的效果是合理的。