Conventional methods for human motion synthesis are either deterministic or struggle with the trade-off between motion diversity and motion quality. In response to these limitations, we introduce MoFusion, i.e., a new denoising-diffusion-based framework for high-quality conditional human motion synthesis that can generate long, temporally plausible, and semantically accurate motions based on a range of conditioning contexts (such as music and text). We also present ways to introduce well-known kinematic losses for motion plausibility within the motion diffusion framework through our scheduled weighting strategy. The learned latent space can be used for several interactive motion editing applications -- like inbetweening, seed conditioning, and text-based editing -- thus, providing crucial abilities for virtual character animation and robotics. Through comprehensive quantitative evaluations and a perceptual user study, we demonstrate the effectiveness of MoFusion compared to the state of the art on established benchmarks in the literature. We urge the reader to watch our supplementary video and visit https://vcai.mpi-inf.mpg.de/projects/MoFusion.
翻译:人类运动合成的常规方法要么是确定性的方法,要么是与运动多样性和运动质量之间的取舍斗争。为了应对这些限制,我们引入了MoFusion, 即一个新的基于高品质、有条件的人类运动合成的无影响扩散框架,这种框架可以产生长期、时间上合理和基于一系列调节环境(如音乐和文字)的语义准确的动作。我们还提出一些办法,通过我们预定的权重战略,在运动扩散框架内为运动的可信任性引入众所周知的动态损失。所学的潜在空间可用于若干交互式的动作编辑应用程序 -- -- 如介质、种子调节和文本编辑 -- -- 从而为虚拟字符动画和机器人提供关键能力。我们通过全面的定量评估和概念用户研究,展示了MoFusion相对于文献既定基准的艺术状态的有效性。我们敦促读者观看我们的补充视频并访问 https://vcai.mpi-inf.mpg.de/ projects/MoFusion。