Music to 3D dance generation aims to synthesize realistic and rhythmically synchronized human dance from music. While existing methods often rely on additional genre labels to further improve dance generation, such labels are typically noisy, coarse, unavailable, or insufficient to capture the diversity of real-world music, which can result in rhythm misalignment or stylistic drift. In contrast, we observe that tempo, a core property reflecting musical rhythm and pace, remains relatively consistent across datasets and genres, typically ranging from 60 to 200 BPM. Based on this finding, we propose TempoMoE, a hierarchical tempo-aware Mixture-of-Experts module that enhances the diffusion model and its rhythm perception. TempoMoE organizes motion experts into tempo-structured groups for different tempo ranges, with multi-scale beat experts capturing fine- and long-range rhythmic dynamics. A Hierarchical Rhythm-Adaptive Routing dynamically selects and fuses experts from music features, enabling flexible, rhythm-aligned generation without manual genre labels. Extensive experiments demonstrate that TempoMoE achieves state-of-the-art results in dance quality and rhythm alignment.
翻译:音乐到三维舞蹈生成旨在从音乐中合成逼真且节奏同步的人类舞蹈。现有方法通常依赖额外的音乐流派标签来进一步提升舞蹈生成效果,然而此类标签通常存在噪声、粒度粗糙、难以获取或不足以捕捉现实世界音乐的多样性,可能导致节奏错位或风格漂移。与此相反,我们观察到节奏——反映音乐韵律与速度的核心属性——在不同数据集和流派间保持相对一致性,其范围通常处于60至200 BPM之间。基于这一发现,我们提出TempoMoE,一种层次化的节奏感知专家混合模块,用于增强扩散模型及其节奏感知能力。TempoMoE将动作专家按不同节奏范围组织成节奏结构化的分组,并通过多尺度节拍专家捕捉细粒度与长程的节奏动态。层次化节奏自适应路由机制根据音乐特征动态选择并融合专家,实现无需人工流派标签的灵活且节奏对齐的生成。大量实验表明,TempoMoE在舞蹈质量与节奏对齐方面达到了最先进的性能。