TM2T:3D人类运动和文字相互世代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代 (TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts)

Inspired by the strong ties between vision and language, the two intimate human sensing and communication modalities, our paper aims to explore the generation of 3D human full-body motions from texts, as well as its reciprocal task, shorthanded for text2motion and motion2text, respectively. To tackle the existing challenges, especially to enable the generation of multiple distinct motions from the same text, and to avoid the undesirable production of trivial motionless pose sequences, we propose the use of motion token, a discrete and compact motion representation. This provides one level playing ground when considering both motions and text signals, as the motion and text tokens, respectively. Moreover, our motion2text module is integrated into the inverse alignment process of our text2motion training pipeline, where a significant deviation of synthesized text from the input text would be penalized by a large training loss; empirically this is shown to effectively improve performance. Finally, the mappings in-between the two modalities of motions and texts are facilitated by adapting the neural model for machine translation (NMT) to our context. This autoregressive modeling of the distribution over discrete motion tokens further enables non-deterministic production of pose sequences, of variable lengths, from an input text. Our approach is flexible, could be used for both text2motion and motion2text tasks. Empirical evaluations on two benchmark datasets demonstrate the superior performance of our approach on both tasks over a variety of state-of-the-art methods. Project page: https://ericguo5513.github.io/TM2T/

翻译：受视觉和语言、两种亲密人类感知和交流模式之间紧密联系的启发,我们的文件旨在探索从文本中产生3D人全体动议,以及分别从文本和运动2文本中产生其对应任务,即短短短短短短短短短的文本和运动2文本。为了应对现有的挑战,特别是为了能够从同一文本中产生多种不同的动议,并避免产生不可取的微小的无动运动序列,我们建议使用运动标志、一种离散和紧凑的动作代表。这为在考虑运动和文本信号时提供一种平等的游戏场地,作为运动和文本的象征。此外,我们的运动2文本模块被纳入了我们文本2 培训管道的反对齐进程,在此过程中,综合文本与输入文本之间的重大偏差将受到大量培训损失的制约;从经验上看,这可以有效地改善业绩。最后,我们提议在两种动议和文本模式之间绘制地图,通过调整机器翻译的神经模型(NMT)适合我们的背景。这种从离散动作上分发的自动递增模型,从离动文本符号进一步使非偏差的文本长度能够进一步使我们移动的文本的文本制作成为一种不偏差的平平平平平平平平平平平的平的平平平的平的平平的平的平的平的平平平平的平的平的平的平的平平的平平平平的平的平平平的平的平的平的平的平的平的平的平的平的平平平平的平的平的平的平的平的平的平的平平平平平平平的平的平的平的平的平的平的平的平的平的平的平的平平平平的平平的平的平的平平平平平平的平平的平的平的平的平平的平的平的平的平的平的平的平的平的平的平的平的平的平的平的平平平平平平平平平的平的平的平的平的平的平的平的平的平的平的平的平的平的平的平的平的平的平的平的平的平的平的平的平的平的平的平的平的平的平