In this paper a pure-attention bottom-up approach, called ViGAT, that utilizes an object detector together with a Vision Transformer (ViT) backbone network to derive object and frame features, and a head network to process these features for the task of event recognition and explanation in video, is proposed. The ViGAT head consists of graph attention network (GAT) blocks factorized along the spatial and temporal dimensions in order to capture effectively both local and long-term dependencies between objects or frames. Moreover, using the weighted in-degrees (WiDs) derived from the adjacency matrices at the various GAT blocks, we show that the proposed architecture can identify the most salient objects and frames that explain the decision of the network. A comprehensive evaluation study is performed, demonstrating that the proposed approach provides state-of-the-art results on three large, publicly available video datasets (FCVID, Mini-Kinetics, ActivityNet).
翻译:在本文中,建议采用一种纯粹自下而上的关注方法,称为ViGAT,该方法利用物体探测器和视觉变异器主干网来产生物体和框架特征,并建立一个领导网络来处理这些特征,以便通过视频确认和解释事件的任务。ViGAT头部由在空间和时间维度沿空间和时间维度进行成因的图形关注网块组成,以便有效地捕捉物体或框架之间的局部和长期依赖性。此外,我们利用从各GAT区块的相邻矩阵中得出的加权度(度),我们表明,拟议的结构可以确定解释网络决定的最突出的物体和框架。进行了一项全面评价研究,表明拟议的方法提供了三种向公众公开的大型视频数据集(FCVID、Mini-Kinetics、ActionNet)的艺术结果。