Although audio-visual representation has been proved to be applicable in many downstream tasks, the representation of dancing videos, which is more specific and always accompanied by music with complex auditory contents, remains challenging and uninvestigated. Considering the intrinsic alignment between the cadent movement of dancer and music rhythm, we introduce MuDaR, a novel Music-Dance Representation learning framework to perform the synchronization of music and dance rhythms both in explicit and implicit ways. Specifically, we derive the dance rhythms based on visual appearance and motion cues inspired by the music rhythm analysis. Then the visual rhythms are temporally aligned with the music counterparts, which are extracted by the amplitude of sound intensity. Meanwhile, we exploit the implicit coherence of rhythms implied in audio and visual streams by contrastive learning. The model learns the joint embedding by predicting the temporal consistency between audio-visual pairs. The music-dance representation, together with the capability of detecting audio and visual rhythms, can further be applied to three downstream tasks: (a) dance classification, (b) music-dance retrieval, and (c) music-dance retargeting. Extensive experiments demonstrate that our proposed framework outperforms other self-supervised methods by a large margin.
翻译:虽然视听表演已被证明适用于许多下游任务,但舞蹈录像的表现形式更具体,而且总是伴有音乐,内容复杂的听觉内容,仍然具有挑战性和未经调查。考虑到舞蹈和音乐节奏的摇篮运动与音乐节奏之间的内在一致性,我们引入了音乐-舞蹈代表制(MuDaR),这是一个音乐-舞蹈代表制(MudaR)新颖的学习框架,以明确和隐含的方式对音乐和舞蹈节奏进行同步。具体地说,我们根据音乐节奏分析所启发的视觉外观和运动提示来制作舞蹈节奏。然后,视觉节奏与音乐对应方的时间一致,通过声调强度的膨胀来提取。与此同时,我们利用音频和视觉流中隐含的节奏的内在一致性,通过对比性学习。模型通过预测视听配对之间的时间一致性来学习联合嵌入。音乐-舞蹈代表制,加上探测音频和视觉节奏的能力,可以进一步应用于三个下游任务:(a)舞蹈分类,(b)音乐-调调检索,以及(c)音乐节奏重定,以及(c)音乐-调重定,通过大型的自我调整方法,大规模实验显示我们拟议的框架的自我调整。