Our goal is to synthesize 3D human motions given textual inputs describing simultaneous actions, for example 'waving hand' while 'walking' at the same time. We refer to generating such simultaneous movements as performing 'spatial compositions'. In contrast to temporal compositions that seek to transition from one action to another, spatial compositing requires understanding which body parts are involved in which action, to be able to move them simultaneously. Motivated by the observation that the correspondence between actions and body parts is encoded in powerful language models, we extract this knowledge by prompting GPT-3 with text such as "what are the body parts involved in the action <action name>?", while also providing the parts list and few-shot examples. Given this action-part mapping, we combine body parts from two motions together and establish the first automated method to spatially compose two actions. However, training data with compositional actions is always limited by the combinatorics. Hence, we further create synthetic data with this approach, and use it to train a new state-of-the-art text-to-motion generation model, called SINC ("SImultaneous actioN Compositions for 3D human motions"). In our experiments, we find training on additional synthetic GPT-guided compositional motions improves text-to-motion generation.
翻译:我们的目标是在给定描述同时动作的文本输入的情况下合成 3D 人物动作,例如“同时挥手”和“行走”。我们将生成这种同时运动称为执行“空间组合”。与寻求从一个动作过渡到另一个动作的时间组合相反,空间组合需要理解哪些身体部位参与哪些动作,以能够同时移动它们。受到观察到的这种行动与身体部位之间的对应关系被编码在强大的语言模型中的启示,我们通过提示 GPT-3 文本,例如“涉及动作的身体部位是什么?”同时还提供部分列表和很少的示例,提取这些知识。鉴于训练组合动作的训练数据始终受到组合数的限制,因此我们进一步使用此方法创建了合成数据,并将其用于训练新的最先进的文本到动作生成模型,称为 SINC(“用于 3D 人体动作的同时动作组合”)。在我们的实验中,我们发现在额外的合成 GPT 引导的组成动作训练中改善了文本到动作生成。