We propose a novel Transformer-based architecture for the task of generative modelling of 3D human motion. Previous work commonly relies on RNN-based models considering shorter forecast horizons reaching a stationary and often implausible state quickly. Recent studies show that implicit temporal representations in the frequency domain are also effective in making predictions for a predetermined horizon. Our focus lies on learning spatio-temporal representations autoregressively and hence generation of plausible future developments over both short and long term. The proposed model learns high dimensional embeddings for skeletal joints and how to compose a temporally coherent pose via a decoupled temporal and spatial self-attention mechanism. Our dual attention concept allows the model to access current and past information directly and to capture both the structural and the temporal dependencies explicitly. We show empirically that this effectively learns the underlying motion dynamics and reduces error accumulation over time observed in auto-regressive models. Our model is able to make accurate short-term predictions and generate plausible motion sequences over long horizons. We make our code publicly available at https://github.com/eth-ait/motion-transformer.
翻译:我们为3D人类运动的基因建模任务提出了一个新的基于变异器的架构。以前的工作通常依赖于基于RNN的模型,这些模型考虑的预测视野较短,达到固定的而且往往难以相信的状态。最近的研究显示,频率域内隐含的时间表示也有效地对预定的地平线作出预测。我们的重点是学习时空表示自动递减,从而产生短期和长期的可信的未来发展。拟议模型学习骨骼关节的高维嵌入,以及如何通过脱钩的时间和空间自留机制来构筑一个具有时间一致性的布局。我们的双重关注概念使模型能够直接获取当前和过去的信息,并明确捕捉到结构和时间依赖性。我们从经验上表明,这有效地了解了运动的基本动态,并减少了在自递递递模式所观察到的时间上的错误积累。我们的模型能够作出准确的短期预测,并产生长期的可信的运动序列。我们在https://github.com/eth-action-transti-traction中公开公布了我们的代码。