3D human motion generation is crucial for creative industry. Recent advances rely on generative models with domain knowledge for text-driven motion generation, leading to substantial progress in capturing common motions. However, the performance on more diverse motions remains unsatisfactory. In this work, we propose ReMoDiffuse, a diffusion-model-based motion generation framework that integrates a retrieval mechanism to refine the denoising process. ReMoDiffuse enhances the generalizability and diversity of text-driven motion generation with three key designs: 1) Hybrid Retrieval finds appropriate references from the database in terms of both semantic and kinematic similarities. 2) Semantic-Modulated Transformer selectively absorbs retrieval knowledge, adapting to the difference between retrieved samples and the target motion sequence. 3) Condition Mixture better utilizes the retrieval database during inference, overcoming the scale sensitivity in classifier-free guidance. Extensive experiments demonstrate that ReMoDiffuse outperforms state-of-the-art methods by balancing both text-motion consistency and motion quality, especially for more diverse motion generation.
翻译:三维人体运动生成对于创意产业至关重要。最近的进展依赖于具有领域知识的生成模型,用于驱动文本生成运动,从而在捕捉常见运动方面取得了实质性的进展。然而,在更多样化运动方面的表现仍然令人不满意。在这项工作中,我们提出了 ReMoDiffuse,一种基于扩散模型的运动生成框架,该框架集成了一个检索机制来改善去噪过程。ReMoDiffuse 通过三个关键设计增强了文本驱动的运动生成的通用性和多样性: 1) 混合检索根据语义和动力学相似性从数据库中找到适当的参考。2) 语义调制变压器有选择地吸收检索知识,适应检索样本与目标运动序列之间的差异。 3) 条件混合在推理过程中更好地利用检索数据库,克服了无分类器指导下的尺度灵敏度。广泛的实验表明,ReMoDiffuse 在平衡文本-运动一致性和运动质量方面优于现有的 state-of-the-art 方法,特别是对于更多样化的运动生成。