We propose a new transformer model for the task of unsupervised learning of skeleton motion sequences. The existing transformer model utilized for unsupervised skeleton-based action learning is learned the instantaneous velocity of each joint from adjacent frames without global motion information. Thus, the model has difficulties in learning the attention globally over whole-body motions and temporally distant joints. In addition, person-to-person interactions have not been considered in the model. To tackle the learning of whole-body motion, long-range temporal dynamics, and person-to-person interactions, we design a global and local attention mechanism, where, global body motions and local joint motions pay attention to each other. In addition, we propose a novel pretraining strategy, multi-interval pose displacement prediction, to learn both global and local attention in diverse time ranges. The proposed model successfully learns local dynamics of the joints and captures global context from the motion sequences. Our model outperforms state-of-the-art models by notable margins in the representative benchmarks. Codes are available at https://github.com/Boeun-Kim/GL-Transformer.
翻译:我们建议一个新的变压器模型,用于不受监督地学习骨骼运动序列的任务。现有的变压器模型,用于不受监督的骨骼行动学习的变压器模型,是在没有全球运动信息的情况下,从相邻的框架学习每个联合点的瞬时速度。因此,该模型难以了解全球对整体运动和时间上遥远的连接点的关注。此外,模型没有考虑人与人的互动。为了解决全体运动、长距离时间动态和人与人之间的相互作用的学习问题,我们设计了一个全球和地方注意机制,让全球机构动议和地方联合动议互相关注。此外,我们提出了一个新的培训前战略,即多interval变迁预测,以在不同的时间范围内学习全球和地方的注意力。拟议的模型成功地学习了联合点的当地动态,并从运动序列中捕捉到全球背景。我们的模型在代表基准中以显著的边际比来超越了状态-艺术模型。代码可在 https://github.com/Boeun-Kim/GL-Transex中查阅。