OmniMoGen：通过交错文本-运动指令学习实现人体运动生成的统一 (OmniMoGen: Unifying Human Motion Generation via Learning from Interleaved Text-Motion Instructions)

Large language models (LLMs) have unified diverse linguistic tasks within a single framework, yet such unification remains unexplored in human motion generation. Existing methods are confined to isolated tasks, limiting flexibility for free-form and omni-objective generation. To address this, we propose OmniMoGen, a unified framework that enables versatile motion generation through interleaved text-motion instructions. Built upon a concise RVQ-VAE and transformer architecture, OmniMoGen supports end-to-end instruction-driven motion generation. We construct X2Mo, a large-scale dataset of over 137K interleaved text-motion instructions, and introduce AnyContext, a benchmark for evaluating interleaved motion generation. Experiments show that OmniMoGen achieves state-of-the-art performance on text-to-motion, motion editing, and AnyContext, exhibiting emerging capabilities such as compositional editing, self-reflective generation, and knowledge-informed generation. These results mark a step toward the next intelligent motion generation. Project Page: https://OmniMoGen.github.io/.

翻译：大型语言模型（LLM）已在单一框架内统一了多样化的语言任务，然而在人体运动生成领域，此类统一仍未得到探索。现有方法局限于孤立的任务，限制了自由形式与全目标生成的灵活性。为解决此问题，我们提出OmniMoGen，一个通过交错文本-运动指令实现多功能运动生成的统一框架。基于简洁的RVQ-VAE与Transformer架构构建，OmniMoGen支持端到端的指令驱动运动生成。我们构建了X2Mo，一个包含超过13.7万条交错文本-运动指令的大规模数据集，并引入了AnyContext，一个用于评估交错运动生成的基准。实验表明，OmniMoGen在文本到运动、运动编辑以及AnyContext基准上均取得了最先进的性能，并展现出组合编辑、自反思生成以及知识引导生成等新兴能力。这些成果标志着向下一代智能运动生成迈出了一步。项目页面：https://OmniMoGen.github.io/。