视觉声音:与不同模式兼容的视听讲话分离 (VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency)

We introduce a new approach for audio-visual speech separation. Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers. Whereas existing methods focus on learning the alignment between the speaker's lip movements and the sounds they generate, we propose to leverage the speaker's face appearance as an additional prior to isolate the corresponding vocal qualities they are likely to produce. Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video. It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement, and generalizes well to challenging real-world videos of diverse scenarios. Our video results and code: http://vision.cs.utexas.edu/projects/VisualVoice/.

翻译：我们引入了一种新的视听语言分离方法。在视频中,我们的目标是在同时使用背景声音和/或其他人类演讲者的情况下,提取与面孔有关的演讲。虽然现有的方法侧重于学习演讲者的嘴唇运动和声音之间的调和,但我们建议利用演讲者的面貌作为额外手段,在隔离他们可能制作的相应的声质之前,利用演讲者的面貌。我们的方法是共同学习视听语言分离和跨模式演讲者从未贴标签的视频中嵌入。它为视听语言分离和强化的五个基准数据集提供了最先进的结果,并概括了挑战现实世界不同情景的视频。我们的视频结果和代码是:http://vision.cs.utexas.edu/projects/VisualVoice/。