We study the problem of human action recognition using motion capture (MoCap) sequences. Unlike existing techniques that take multiple manual steps to derive standardized skeleton representations as model input, we propose a novel Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences. The model uses a hierarchical transformer with intra-frame off-set attention and inter-frame self-attention. The attention mechanism allows the model to freely attend between any two vertex patches to learn non-local relationships in the spatial-temporal domain. Masked vertex modeling and future frame prediction are used as two self-supervised tasks to fully activate the bi-directional and auto-regressive attention in our hierarchical transformer. The proposed method achieves state-of-the-art performance compared to skeleton-based and point-cloud-based models on common MoCap benchmarks. Code is available at https://github.com/zgzxy001/STMT.
翻译:我们研究了使用动作捕捉(MoCap)序列进行人体动作识别的问题。与现有技术不同,需要多个手动步骤来导出标准化的骨骼表示作为模型输入,我们提出了一种新颖的空时网格转换器(STMT)来直接建模网格序列。该模型使用带有帧内偏移注意力和帧间自注意力的分层Transformer。注意机制允许模型自由地关注空时域中任意两个顶点块之间的非局部关系。遮蔽顶点建模和未来帧预测作为两个自监督任务,以完全激活我们的分层Transformer中的双向和自回归注意力。所提出的方法在常用的MoCap基准测试中实现了超越基于骨架和基于点云的模型的最新表现。代码可在https://github.com/zgzxy001/STMT 找到。