3D Human motion generation is pivotal across film, animation, gaming, and embodied intelligence. Traditional 3D motion synthesis relies on costly motion capture, while recent work shows that 2D videos provide rich, temporally coherent observations of human behavior. Existing approaches, however, either map high-level text descriptions to motion or rely solely on video conditioning, leaving a gap between generated dynamics and real-world motion statistics. We introduce MotionDuet, a multimodal framework that aligns motion generation with the distribution of video-derived representations. In this dual-conditioning paradigm, video cues extracted from a pretrained model (e.g., VideoMAE) ground low-level motion dynamics, while textual prompts provide semantic intent. To bridge the distribution gap across modalities, we propose Dual-stream Unified Encoding and Transformation (DUET) and a Distribution-Aware Structural Harmonization (DASH) loss. DUET fuses video-informed cues into the motion latent space via unified encoding and dynamic attention, while DASH aligns motion trajectories with both distributional and structural statistics of video features. An auto-guidance mechanism further balances textual and visual signals by leveraging a weakened copy of the model, enhancing controllability without sacrificing diversity. Extensive experiments demonstrate that MotionDuet generates realistic and controllable human motions, surpassing strong state-of-the-art baselines.
翻译:三维人体动作生成在电影、动画、游戏和具身智能等领域具有关键作用。传统的三维动作合成依赖于昂贵的动作捕捉技术,而近期研究表明,二维视频能够提供丰富且时序一致的人体行为观测。然而,现有方法要么将高层级文本描述映射为动作,要么仅依赖视频条件,导致生成的动作动力学与真实世界动作统计之间存在差距。本文提出MotionDuet,一种多模态框架,通过将动作生成与视频衍生表征的分布对齐来解决这一问题。在该双条件范式中,从预训练模型(如VideoMAE)提取的视频线索为低层级动作动力学提供基础,而文本提示则提供语义意图。为弥合跨模态的分布差异,我们提出了双流统一编码与变换(DUET)以及分布感知结构协调(DASH)损失。DUET通过统一编码和动态注意力将视频信息线索融合到动作潜在空间中,而DASH则将动作轨迹与视频特征的分布统计和结构统计对齐。此外,通过利用模型的弱化副本,自引导机制进一步平衡文本与视觉信号,在保持多样性的同时增强了可控性。大量实验表明,MotionDuet能够生成逼真且可控的人体动作,性能优于现有先进基线方法。