Graph Neural Networks are perfectly suited to capture latent interactions between various entities in the spatio-temporal domain (e.g. videos). However, when an explicit structure is not available, it is not obvious what atomic elements should be represented as nodes. Current works generally use pre-trained object detectors or fixed, predefined regions to extract graph nodes. Improving upon this, our proposed model learns nodes that dynamically attach to well-delimited salient regions, which are relevant for a higher-level task, without using any object-level supervision. Constructing these localized, adaptive nodes gives our model inductive bias towards object-centric representations and we show that it discovers regions that are well correlated with objects in the video. In extensive ablation studies and experiments on two challenging datasets, we show superior performance to previous graph neural networks models for video classification.
翻译:神经网络图完全适合捕捉时空空间领域不同实体之间的潜在互动(例如视频),然而,当没有明确的结构时,还不清楚原子元素应作为节点。目前的工作通常使用预先训练的物体探测器或固定的、预设的区域来提取图形节点。有了改进,我们提议的模型学习了动态地附着于极有限突出区域的节点,这些节点与更高层次的任务相关,而没有使用任何目标级别的监督。构建这些本地的、适应性的节点给我们的模型提供了对以物体为中心的表示的偏向,我们显示它发现了与视频中对象密切相关的区域。在对两个挑战性数据集进行的广泛反动研究和实验中,我们展示了以往用于视频分类的图形神经网络模型的优异性性。