In the field of action recognition, video clips are always treated as ordered frames for subsequent processing. To achieve spatio-temporal perception, existing approaches propose to embed adjacent temporal interaction in the convolutional layer. The global semantic information can therefore be obtained by stacking multiple local layers hierarchically. However, such global temporal accumulation can only reflect the high-level semantics in deep layers, neglecting the potential low-level holistic clues in shallow layers. In this paper, we first propose to transform a video sequence into a graph to obtain direct long-term dependencies among temporal frames. To preserve sequential information during transformation, we devise a structured graph module (SGM), achieving fine-grained temporal interactions throughout the entire network. In particular, SGM divides the neighbors of each node into several temporal regions so as to extract global structural information with diverse sequential flows. Extensive experiments are performed on standard benchmark datasets, i.e., Something-Something V1 & V2, Diving48, Kinetics-400, UCF101, and HMDB51. The reported performance and analysis demonstrate that SGM can achieve outstanding precision with less computational complexity.
翻译:在行动识别领域,视频剪辑始终被视为以后处理的定序框架。为了实现时空感知,现有方法建议将相邻时间互动嵌入进化层。因此,全球语义信息可以通过堆叠多层地方层次来获得。然而,这种全球时间积累只能反映深层的高层次语义,忽视浅层潜在的低层次整体线索。在本文中,我们首先建议将视频序列转换成图表,以获得时间框架之间的直接长期依赖。为了在转型过程中保存连续信息,我们设计了一个结构化的图表模块(SGM),在整个网络中实现细细微刻时间互动。特别是,SGM将每个节点的邻居分成几个时间区域,以便利用不同顺序流提取全球结构信息。我们首先对标准基准数据集进行了广泛的实验,即某些微量 V1 & V2, Diving48, Kinitics-400, UCFC-100, UMDB51。所报告的业绩和分析表明,SGM能够以不那么精确的精确度进行精确的计算。