Text-based motion generation models are drawing a surge of interest for their potential for automating the motion-making process in the game, animation, or robot industries. In this paper, we propose a diffusion-based motion synthesis and editing model named FLAME. Inspired by the recent successes in diffusion models, we integrate diffusion-based generative models into the motion domain. FLAME can generate high-fidelity motions well aligned with the given text. Also, it can edit the parts of the motion, both frame-wise and joint-wise, without any fine-tuning. FLAME involves a new transformer-based architecture we devise to better handle motion data, which is found to be crucial to manage variable-length motions and well attend to free-form text. In experiments, we show that FLAME achieves state-of-the-art generation performances on three text-motion datasets: HumanML3D, BABEL, and KIT. We also demonstrate that editing capability of FLAME can be extended to other tasks such as motion prediction or motion in-betweening, which have been previously covered by dedicated models.
翻译:基于文本的动作生成模型正在吸引人们对其在游戏、动画或机器人行业中运动制作过程自动化的潜力的兴趣。 在本文中,我们提议了一个名为FLAME的基于扩散的动作合成和编辑模型。 受最近扩散模型成功经验的启发, 我们将基于扩散的基因化模型整合到运动域中。 FLAME 能够产生与给定文本完全一致的高不洁动作。 此外, 它可以在不作任何微调的情况下编辑运动的部分, 包括框架性动作和联合性动作。 FLAME 包含一个我们设计来更好地处理运动数据的新变压器结构, 这被认为对于管理变长动作和很好地关注自由格式文本至关重要。 在实验中, 我们显示FLAME 可以在三个文本移动数据集( HumanML3D, BABEL, 和 KIT) 上实现最先进的一代表现。 我们还表明, FLAME 的编辑能力可以扩大到其他任务, 如运动预测或移动间运动, 之前由专用模型所覆盖的其他任务。