We propose a novel task for generating 3D dance movements that simultaneously incorporate both text and music modalities. Unlike existing works that generate dance movements using a single modality such as music, our goal is to produce richer dance movements guided by the instructive information provided by the text. However, the lack of paired motion data with both music and text modalities limits the ability to generate dance movements that integrate both. To alleviate this challenge, we propose to utilize a 3D human motion VQ-VAE to project the motions of the two datasets into a latent space consisting of quantized vectors, which effectively mix the motion tokens from the two datasets with different distributions for training. Additionally, we propose a cross-modal transformer to integrate text instructions into motion generation architecture for generating 3D dance movements without degrading the performance of music-conditioned dance generation. To better evaluate the quality of the generated motion, we introduce two novel metrics, namely Motion Prediction Distance (MPD) and Freezing Score, to measure the coherence and freezing percentage of the generated motion. Extensive experiments show that our approach can generate realistic and coherent dance movements conditioned on both text and music while maintaining comparable performance with the two single modalities. Code will be available at: https://garfield-kh.github.io/TM2D/.
翻译:我们提出了一个新颖的任务,即生成同时结合文本和音乐两种模态的三维舞蹈动作。与现有的仅使用单一模态,如音乐生成舞蹈动作的方法不同,我们的目标是在文本提供的指示信息的指导下产生更加丰富的舞蹈动作。然而,缺乏既包括音乐,又包括文本的配对动作数据会限制整合双模态的舞蹈动作的能力。为解决此问题,我们提出利用一个三维人体运动的向量量化自动编码器来将两个数据集的动作映射到由量化向量组成的潜在空间中,此方法将来自具有不同分布的两个数据集的运动标记有效地混合在一起进行训练。此外,我们提出了一个跨模态变压器来将文本指示整合到运动生成结构中,以生成既结合文本和音乐,又具有可比性的三维舞蹈动作。为了更好地评估生成动作的质量,我们引入了两种新的度量标准,即运动预测距离(MPD)和固定度分数,以衡量生成动作的一致性和冻结百分比。广泛的实验表明,我们的方法可以在同时结合文本和音乐两种模态的情况下生成逼真和连贯的舞蹈动作,同时保持与两种单一模态相当的性能。代码将可在以下网址中获取:https://garfield-kh.github.io/TM2D/.