Large language models (LLMs) have unified diverse linguistic tasks within a single framework, yet such unification remains unexplored in human motion generation. Existing methods are confined to isolated tasks, limiting flexibility for free-form and omni-objective generation. To address this, we propose OmniMoGen, a unified framework that enables versatile motion generation through interleaved text-motion instructions. Built upon a concise RVQ-VAE and transformer architecture, OmniMoGen supports end-to-end instruction-driven motion generation. We construct X2Mo, a large-scale dataset of over 137K interleaved text-motion instructions, and introduce AnyContext, a benchmark for evaluating interleaved motion generation. Experiments show that OmniMoGen achieves state-of-the-art performance on text-to-motion, motion editing, and AnyContext, exhibiting emerging capabilities such as compositional editing, self-reflective generation, and knowledge-informed generation. These results mark a step toward the next intelligent motion generation. Project Page: https://OmniMoGen.github.io/.
翻译:大型语言模型(LLM)已在单一框架内统一了多样化的语言任务,然而在人体运动生成领域,此类统一仍未得到探索。现有方法局限于孤立的任务,限制了自由形式与全目标生成的灵活性。为解决此问题,我们提出OmniMoGen,一个通过交错文本-运动指令实现多功能运动生成的统一框架。基于简洁的RVQ-VAE与Transformer架构构建,OmniMoGen支持端到端的指令驱动运动生成。我们构建了X2Mo,一个包含超过13.7万条交错文本-运动指令的大规模数据集,并引入了AnyContext,一个用于评估交错运动生成的基准。实验表明,OmniMoGen在文本到运动、运动编辑以及AnyContext基准上均取得了最先进的性能,并展现出组合编辑、自反思生成以及知识引导生成等新兴能力。这些成果标志着向下一代智能运动生成迈出了一步。项目页面:https://OmniMoGen.github.io/。