Text-to-motion generation is an emerging and challenging problem, which aims to synthesize motion with the same semantics as the input text. However, due to the lack of diverse labeled training data, most approaches either limit to specific types of text annotations or require online optimizations to cater to the texts during inference at the cost of efficiency and stability. In this paper, we investigate offline open-vocabulary text-to-motion generation in a zero-shot learning manner that neither requires paired training data nor extra online optimization to adapt for unseen texts. Inspired by the prompt learning in NLP, we pretrain a motion generator that learns to reconstruct the full motion from the masked motion. During inference, instead of changing the motion generator, our method reformulates the input text into a masked motion as the prompt for the motion generator to ``reconstruct'' the motion. In constructing the prompt, the unmasked poses of the prompt are synthesized by a text-to-pose generator. To supervise the optimization of the text-to-pose generator, we propose the first text-pose alignment model for measuring the alignment between texts and 3D poses. And to prevent the pose generator from overfitting to limited training texts, we further propose a novel wordless training mechanism that optimizes the text-to-pose generator without any training texts. The comprehensive experimental results show that our method obtains a significant improvement against the baseline methods. The code is available.
翻译:文本到动作的生成是一个新兴而具有挑战性的问题,旨在合成与输入文本相同语义的动作。然而,由于缺乏多样化的标注训练数据,大多数方法要么局限于特定类型的文本注释,要么需要在线优化,以适应推理过程中的文本,但会导致效率和稳定性降低的问题。在本文中,我们研究用零样本学习的方式离线开放词汇文本到动作生成,不需要有对应的训练数据或额外的在线优化来适应未见过的文本。受NLP中prompt学习的启发,我们预先训练一个动作生成器,该生成器学习从掩蔽的动作中重建全动作。在推理过程中,我们的方法不改变动作生成器,而是将输入文本重构为掩蔽的动作,作为动作生成器“重建”动作的提示。在构建提示时,掩蔽的姿势是由文本到姿势生成器合成的。为了监督文本到姿势生成器的优化,我们提出了第一个文本-姿态对齐模型,用于衡量文本和3D姿态之间的对齐程度。为了防止姿态生成器过度拟合有限的训练文本,我们还提出了一种新颖的无字训练机制,可以在没有任何训练文本的情况下优化文本到姿势生成器。全面的实验结果表明,我们的方法对比基线方法有显著改进。代码可用。