Video generation models have advanced significantly, yet they still struggle to synthesize complex human movements due to the high degrees of freedom in human articulation. This limitation stems from the intrinsic constraints of pixel-only training objectives, which inherently bias models toward appearance fidelity at the expense of learning underlying kinematic principles. To address this, we introduce EchoMotion, a framework designed to model the joint distribution of appearance and human motion, thereby improving the quality of complex human action video generation. EchoMotion extends the DiT (Diffusion Transformer) framework with a dual-branch architecture that jointly processes tokens concatenated from different modalities. Furthermore, we propose MVS-RoPE (Motion-Video Syncronized RoPE), which offers unified 3D positional encoding for both video and motion tokens. By providing a synchronized coordinate system for the dual-modal latent sequence, MVS-RoPE establishes an inductive bias that fosters temporal alignment between the two modalities. We also propose a Motion-Video Two-Stage Training Strategy. This strategy enables the model to perform both the joint generation of complex human action videos and their corresponding motion sequences, as well as versatile cross-modal conditional generation tasks. To facilitate the training of a model with these capabilities, we construct HuMoVe, a large-scale dataset of approximately 80,000 high-quality, human-centric video-motion pairs. Our findings reveal that explicitly representing human motion is complementary to appearance, significantly boosting the coherence and plausibility of human-centric video generation.
翻译:视频生成模型已取得显著进展,但由于人体关节的高自由度,其在合成复杂人体运动方面仍面临挑战。这一局限性源于仅基于像素的训练目标的内在约束,此类目标本质上使模型偏向于外观保真度,而牺牲了对底层运动学原理的学习。为解决此问题,我们提出了EchoMotion,一个旨在建模外观与人体运动联合分布的框架,从而提升复杂人体动作视频生成的质量。EchoMotion通过双分支架构扩展了DiT(扩散Transformer)框架,该架构联合处理来自不同模态的拼接令牌。此外,我们提出了MVS-RoPE(运动-视频同步RoPE),它为视频令牌和运动令牌提供了统一的3D位置编码。通过为双模态潜在序列提供同步坐标系,MVS-RoPE建立了一种归纳偏置,促进了两种模态之间的时间对齐。我们还提出了一种运动-视频两阶段训练策略。该策略使模型能够执行复杂人体动作视频及其对应运动序列的联合生成,以及多种跨模态条件生成任务。为了训练具备这些能力的模型,我们构建了HuMoVe,一个包含约80,000个高质量、以人为中心的视频-运动对的大规模数据集。我们的研究结果表明,显式表示人体运动与外观表征是互补的,能显著提升以人为中心的视频生成的一致性与合理性。