In this paper, we present a transformer-based learning framework for 3D dance generation conditioned on music. We carefully design our network architecture and empirically study the keys for obtaining qualitatively pleasing results. The critical components include a deep cross-modal transformer, which well learns the correlation between the music and dance motion; and the full-attention with future-N supervision mechanism which is essential in producing long-range non-freezing motion. In addition, we propose a new dataset of paired 3D motion and music called AIST++, which we reconstruct from the AIST multi-view dance videos. This dataset contains 1.1M frames of 3D dance motion in 1408 sequences, covering 10 genres of dance choreographies and accompanied with multi-view camera parameters. To our knowledge it is the largest dataset of this kind. Rich experiments on AIST++ demonstrate our method produces much better results than the state-of-the-art methods both qualitatively and quantitatively.
翻译:在本文中,我们展示了以音乐为条件的3D舞蹈制作的基于变压器的学习框架。 我们仔细设计了我们的网络结构,并用经验研究了获得质量上令人愉快的结果的钥匙。 关键组成部分包括一个深层次的跨模式变压器,它很好地了解音乐和舞蹈运动之间的相互关系; 以及充分注意未来- N 监督机制,这是产生远程不冻运动的关键。 此外,我们提出了一套新的3D运动和音乐的配对数据集,称为AIST++。 我们从AIS多视图舞蹈录像中重建了这个数据集。 这个数据集包含1408个序列的3D舞蹈运动的1.1M框架,包括10个舞蹈舞蹈舞蹈舞蹈舞蹈编队和多视图摄像参数。 据我们所知,这是这种类型的最大的数据集。 有关AIST++的丰富实验展示了我们的方法比在质量上和数量上都好得多的结果。