Graph convolutional networks have been widely used for skeleton-based action recognition due to their excellent modeling ability of non-Euclidean data. As the graph convolution is a local operation, it can only utilize the short-range joint dependencies and short-term trajectory but fails to directly model the distant joints relations and long-range temporal information that are vital to distinguishing various actions. To solve this problem, we present a multi-scale spatial graph convolution (MS-GC) module and a multi-scale temporal graph convolution (MT-GC) module to enrich the receptive field of the model in spatial and temporal dimensions. Concretely, the MS-GC and MT-GC modules decompose the corresponding local graph convolution into a set of sub-graph convolution, forming a hierarchical residual architecture. Without introducing additional parameters, the features will be processed with a series of sub-graph convolutions, and each node could complete multiple spatial and temporal aggregations with its neighborhoods. The final equivalent receptive field is accordingly enlarged, which is capable of capturing both short- and long-range dependencies in spatial and temporal domains. By coupling these two modules as a basic block, we further propose a multi-scale spatial temporal graph convolutional network (MST-GCN), which stacks multiple blocks to learn effective motion representations for action recognition. The proposed MST-GCN achieves remarkable performance on three challenging benchmark datasets, NTU RGB+D, NTU-120 RGB+D and Kinetics-Skeleton, for skeleton-based action recognition.
翻译:由于图形变异是一种局部操作,它只能使用短距离联合依赖和短期轨迹,但未能直接模拟对区分各种行动至关重要的远距离联合关系和长距离时间信息。为了解决这个问题,我们提出了一个多尺度空间图变动模块(MS-GC)和一个多尺度时间图变迁模块(MT-GC),以丰富该模型在空间和时间层面的可接受领域。具体地说,MS-GC和MT-GC模块将相应的本地图变异转换成一套子图变异,形成一个等级残余结构。如果不引入额外的参数,这些特征将随着一系列子图变异(MS-GC)处理,而每个节将完成与其周边的多个空间和时间群集。因此,最终等值的图像变异(MT-GC)模块将扩大,从而能够捕捉空间和时空层面的可容模型的可容领域。具体地,MS-GC和M-GC模块将相应的本地图变异变变变变变变变变变变变组合成一组。