Skeleton-based Human Activity Recognition has achieved great interest in recent years as skeleton data has demonstrated being robust to illumination changes, body scales, dynamic camera views, and complex background. In particular, Spatial-Temporal Graph Convolutional Networks (ST-GCN) demonstrated to be effective in learning both spatial and temporal dependencies on non-Euclidean data such as skeleton graphs. Nevertheless, an effective encoding of the latent information underlying the 3D skeleton is still an open problem, especially when it comes to extracting effective information from joint motion patterns and their correlations. In this work, we propose a novel Spatial-Temporal Transformer network (ST-TR) which models dependencies between joints using the Transformer self-attention operator. In our ST-TR model, a Spatial Self-Attention module (SSA) is used to understand intra-frame interactions between different body parts, and a Temporal Self-Attention module (TSA) to model inter-frame correlations. The two are combined in a two-stream network, whose performance is evaluated on three large-scale datasets, NTU-RGB+D 60, NTU-RGB+D 120, and Kinetics Skeleton 400, consistently improving backbone results. Compared with methods that use the same input data, the proposed ST-TR achieves state-of-the-art performance on all datasets when using joints' coordinates as input, and results on-par with state-of-the-art when adding bones information.
翻译:近年来,由于骨架数据对照明变化、体标尺、动态相机视图和复杂背景表现出强健,基于皮肤的人类活动认识近年来引起了极大的兴趣,因为骨架数据显示对照明变化、体标尺、动态相机视图和复杂背景具有很强的活力。特别是,空间-时际图变网络(ST-GCN)在学习空间和时间依赖非细胞类数据(如骨架图)方面证明是有效的。然而,有效编码3D骨架背后的潜在信息仍然是一个尚未解决的问题,特别是在从联合运动模式及其相互关系中提取有效信息方面。在这项工作中,我们提议建立一个新型的空间-时空变变变器坐标(ST-TR)网络(ST-TR),这个网络在使用变换器自我注意操作操作操作操作者之间的联合连接之间,可以有效地学习空间自控模块(SSA),可以理解不同体部分之间的内部互动,而温度自控模块(TSA),用来模拟内部关联关系。 两种系统在双流网络中,其性能被评估为三大级数据、NTU-RG-D、NTU+G-com结果, 持续使用所有数据结果, 不断使用SIRG-B 改进N-st-st-sty-st-st-sty-st-st-st-stimp-stal-stal-stital-stital-stil