We propose a video feature representation learning framework called STAR-GNN, which applies a pluggable graph neural network component on a multi-scale lattice feature graph. The essence of STAR-GNN is to exploit both the temporal dynamics and spatial contents as well as visual connections between regions at different scales in the frames. It models a video with a lattice feature graph in which the nodes represent regions of different granularity, with weighted edges that represent the spatial and temporal links. The contextual nodes are aggregated simultaneously by graph neural networks with parameters trained with retrieval triplet loss. In the experiments, we show that STAR-GNN effectively implements a dynamic attention mechanism on video frame sequences, resulting in the emphasis for dynamic and semantically rich content in the video, and is robust to noise and redundancies. Empirical results show that STAR-GNN achieves state-of-the-art performance for Content-Based Video Retrieval.
翻译:我们提议了一个名为STAR-GNN的视频特征代表学习框架,在多尺度的 lattice 特征图中应用一个插图式神经网络组件。 STAR- GNN 的精髓是利用时间动态和空间内容以及不同比例的区域在框架中的视觉连接。 它用一个拉特式特征图模拟一个视频,其中节点代表不同颗粒的区域,具有代表空间和时间链接的加权边缘。 上下文节点由具有检索三重损失参数的图形神经网络同时汇总。 在实验中, 我们展示STAR- GNN 有效地对视频框架序列实施动态关注机制, 从而强调视频中充满动态和语系内容的内容, 并且对噪音和冗余力非常有力。 经验性结果显示, Star- GNNN 实现了基于内容的视频Retrival 的状态艺术表现。