We consider the problem of synthesizing multi-action human motion sequences of arbitrary lengths. Existing approaches have mastered motion sequence generation in single action scenarios, but fail to generalize to multi-action and arbitrary-length sequences. We fill this gap by proposing a novel efficient approach that leverages expressiveness of Recurrent Transformers and generative richness of conditional Variational Autoencoders. The proposed iterative approach is able to generate smooth and realistic human motion sequences with an arbitrary number of actions and frames while doing so in linear space and time. We train and evaluate the proposed approach on PROX and Charades datasets, where we augment PROX with ground-truth action labels and Charades with human mesh annotations. Experimental evaluation shows significant improvements in FID score and semantic consistency metrics compared to the state-of-the-art.
翻译:我们考虑的是将任意长度的多动作人类运动序列综合在一起的问题。现有方法在单一行动情景中掌握了运动序列生成,但未能推广多动作和任意的序列。我们通过提出一种新的高效方法来填补这一空白,利用经常变换器的清晰度和有条件变异自动转换器的基因丰富度。提议的迭代方法能够产生平滑和现实的人类运动序列,同时在线性空间和时间上任意设定若干动作和框架。我们训练和评价了PROX和Charades数据集的拟议方法,即我们用地面真实行动标签和拼图来增强PROX。实验性评估显示,与最新技术相比,FID的分数和语义一致性指标有了显著改善。