Target speech separation refers to extracting a target speaker's voice from an overlapped audio of simultaneous talkers. Previously the use of visual modality for target speech separation has demonstrated great potentials. This work proposes a general multi-modal framework for target speech separation by utilizing all the available information of the target speaker, including his/her spatial location, voice characteristics and lip movements. Also, under this framework, we investigate on the fusion methods for multi-modal joint modeling. A factorized attention-based fusion method is proposed to aggregate the high-level semantic information of multi-modalities at embedding level. This method firstly factorizes the mixture audio into a set of acoustic subspaces, then leverages the target's information from other modalities to enhance these subspace acoustic embeddings with a learnable attention scheme. To validate the robustness of proposed multi-modal separation model in practical scenarios, the system was evaluated under the condition that one of the modalities is temporarily missing, invalid or corrupted. Experiments are conducted on a large-scale audio-visual dataset collected from YouTube (to be released) that spatialized by simulated room impulse responses (RIRs). Experiment results illustrate that our proposed multi-modal framework significantly outperforms single-modal and bi-modal speech separation approaches, while can still support real-time processing.
翻译:目标语音分离是指从同步谈话器的重叠音频中提取目标发言者的声音。在使用目标语音分离的视觉模式之前,已经显示出巨大的潜力。这项工作提议了一个目标语音分离的通用多模式框架,利用目标发言者的所有现有信息,包括空间位置、声音特征和嘴唇运动等,为目标语音分离提供一个总体多模式框架。在这个框架下,我们调查多模式联合模型的聚合方法。建议采用基于因素的聚合法,将嵌入层多模式的高层次语义信息汇总在一起。这种方法首先将混合音频纳入一组音频子空间,然后利用目标目标来自其他模式的信息,利用可学习的注意计划来加强这些亚空间声频嵌入。为了在实际情景中验证拟议的多模式分离模型的稳健性,在以下条件下对系统进行了评估:一种模式暂时缺失、无效或损坏。在从YouTube收集的大型视听数据集上进行了实验。该方法首先将混合音频音解纳入一组音频子空间子空间分空间分空间空间空间空间空间空间空间分入一组子空间,然后利用其他模式加强这些亚空间声音嵌式声音嵌入器嵌入器,同时进行模拟模拟的模拟的模拟双式语音模式语音处理,以模拟的模拟式语音模型模拟的模拟式图像式图像模型模拟组合式图像模型模拟式图像模型模拟模拟模拟模拟式图像处理结果。