Text-driven motion generation has attracted increasing attention due to its broad applications in virtual reality, animation, and robotics. While existing methods typically model human and animal motion separately, a joint cross-species approach offers key advantages, such as a unified representation and improved generalization. However, morphological differences across species remain a key challenge, often compromising motion plausibility. To address this, we propose X-MoGen, the first unified framework for cross-species text-driven motion generation covering both humans and animals. X-MoGen adopts a two-stage architecture. First, a conditional graph variational autoencoder learns canonical T-pose priors, while an autoencoder encodes motion into a shared latent space regularized by morphological loss. In the second stage, we perform masked motion modeling to generate motion embeddings conditioned on textual descriptions. During training, a morphological consistency module is employed to promote skeletal plausibility across species. To support unified modeling, we construct UniMo4D, a large-scale dataset of 115 species and 119k motion sequences, which integrates human and animal motions under a shared skeletal topology for joint training. Extensive experiments on UniMo4D demonstrate that X-MoGen outperforms state-of-the-art methods on both seen and unseen species.
翻译:文本驱动的运动生成因其在虚拟现实、动画和机器人领域的广泛应用而日益受到关注。现有方法通常分别建模人类和动物的运动,而跨物种的联合方法具有关键优势,如统一表示和改进的泛化能力。然而,物种间的形态差异仍是主要挑战,常影响运动的合理性。为此,我们提出X-MoGen,首个覆盖人类与动物的跨物种文本驱动运动生成统一框架。X-MoGen采用两阶段架构:首先,条件图变分自编码器学习规范的T姿态先验,同时自编码器将运动编码到由形态损失正则化的共享潜在空间;第二阶段,通过掩码运动建模生成基于文本描述的运动嵌入。训练中采用形态一致性模块以提升跨物种骨骼运动的合理性。为支持统一建模,我们构建了UniMo4D大规模数据集,涵盖115个物种和11.9万条运动序列,在共享骨骼拓扑下整合人类与动物运动用于联合训练。在UniMo4D上的大量实验表明,X-MoGen在已知和未知物种上均优于现有先进方法。