We present an audio-visual speech separation learning method that considers the correspondence between the separated signals and the visual signals to reflect the speech characteristics during training. Audio-visual speech separation is a technique to estimate the individual speech signals from a mixture using the visual signals of the speakers. Conventional studies on audio-visual speech separation mainly train the separation model on the audio-only loss, which reflects the distance between the source signals and the separated signals. However, conventional losses do not reflect the characteristics of the speech signals, including the speaker's characteristics and phonetic information, which leads to distortion or remaining noise. To address this problem, we propose the cross-modal correspondence (CMC) loss, which is based on the cooccurrence of the speech signal and the visual signal. Since the visual signal is not affected by background noise and contains speaker and phonetic information, using the CMC loss enables the audio-visual speech separation model to remove noise while preserving the speech characteristics. Experimental results demonstrate that the proposed method learns the cooccurrence on the basis of CMC loss, which improves separation performance.
翻译:我们提出一种视听语言分离学习方法,考虑分开的信号和视觉信号之间的对应,以反映培训期间的言论特点; 视听语言分离是一种技术,用发言人的视觉信号来估计混合体的个别语言信号。关于视听语言分离的常规研究主要是将分离模型训练在只听音的损失上,这反映了源信号和分离信号之间的距离。然而,常规损失并不反映语言信号的特性,包括发言者的特性和语音信息,这会导致扭曲或留有噪音。为了解决这个问题,我们建议采用跨式通信损失,这是基于语音信号和视觉信号的重叠。由于视觉信号不受背景噪音的影响,包含语音和语音信息,因此使用CMC损失使视听语言分离模型能够消除噪音,同时保持语音特征。实验结果表明,拟议的方法在CMC损失的基础上学习了矛盾现象,从而改进了分离性表现。