Graph neural networks have shown to learn effective node representations, enabling node-, link-, and graph-level inference. Conventional graph networks assume static relations between nodes, while relations between entities in a video often evolve over time, with nodes entering and exiting dynamically. In such temporally-dynamic graphs, a core problem is inferring the future state of spatio-temporal edges, which can constitute multiple types of relations. To address this problem, we propose MTD-GNN, a graph network for predicting temporally-dynamic edges for multiple types of relations. We propose a factorized spatio-temporal graph attention layer to learn dynamic node representations and present a multi-task edge prediction loss that models multiple relations simultaneously. The proposed architecture operates on top of scene graphs that we obtain from videos through object detection and spatio-temporal linking. Experimental evaluations on ActionGenome and CLEVRER show that modeling multiple relations in our temporally-dynamic graph network can be mutually beneficial, outperforming existing static and spatio-temporal graph neural networks, as well as state-of-the-art predicate classification methods.
翻译:常规图形网络在节点之间呈现静态关系,而视频中实体之间的关系则经常随着时间变化而变化,而节点的进出动态变化。在这种时间动力图形中,核心问题在于推断未来空间-时空边缘的状态,这可以构成多种类型的关系。为了解决这一问题,我们提议MTD-GNN,这是一个用于预测多种类型关系的时间动态边缘的图形网络。我们提议一个因数化的时空图关注层,以学习动态节点的表达方式,并呈现多任务边缘预测损失,同时模拟多种关系。拟议结构以我们通过物体探测和时空-时空连接从视频中获取的场景图为顶端运作。对ActionGenome和CLEVRER的实验性评价表明,我们时间动力图形网络中多种关系的建模可以相互有益,比现有的静态和时空图像图像网络的运行率高,作为上游的分类方法。