Dance-to-music (D2M) generation aims to automatically compose music that is rhythmically and temporally aligned with dance movements. Existing methods typically rely on coarse rhythm embeddings, such as global motion features or binarized joint-based rhythm values, which discard fine-grained motion cues and result in weak rhythmic alignment. Moreover, temporal mismatches introduced by feature downsampling further hinder precise synchronization between dance and music. To address these problems, we propose \textbf{GACA-DiT}, a diffusion transformer-based framework with two novel modules for rhythmically consistent and temporally aligned music generation. First, a \textbf{genre-adaptive rhythm extraction} module combines multi-scale temporal wavelet analysis and spatial phase histograms with adaptive joint weighting to capture fine-grained, genre-specific rhythm patterns. Second, a \textbf{context-aware temporal alignment} module resolves temporal mismatches using learnable context queries to align music latents with relevant dance rhythm features. Extensive experiments on the AIST++ and TikTok datasets demonstrate that GACA-DiT outperforms state-of-the-art methods in both objective metrics and human evaluation. Project page: https://beria-moon.github.io/GACA-DiT/.
翻译:舞蹈到音乐(D2M)生成旨在自动创作与舞蹈动作在节奏和时间上对齐的音乐。现有方法通常依赖于粗糙的节奏嵌入,例如全局运动特征或基于关节的二值化节奏值,这些方法丢弃了细粒度的运动线索,导致节奏对齐效果较弱。此外,特征下采样引入的时间不匹配进一步阻碍了舞蹈与音乐之间的精确同步。为解决这些问题,我们提出了\\textbf{GACA-DiT},一种基于扩散变换器的框架,包含两个新颖模块,用于实现节奏一致且时间对齐的音乐生成。首先,一个\\textbf{流派自适应节奏提取}模块结合多尺度时间小波分析和空间相位直方图,并采用自适应关节加权,以捕捉细粒度的、特定流派的节奏模式。其次,一个\\textbf{上下文感知时间对齐}模块利用可学习的上下文查询来解决时间不匹配问题,将音乐潜在表示与相关的舞蹈节奏特征对齐。在AIST++和TikTok数据集上进行的大量实验表明,GACA-DiT在客观指标和人工评估中均优于现有最先进方法。项目页面:https://beria-moon.github.io/GACA-DiT/。