A speaker extraction algorithm seeks to extract the target speaker's speech from a multi-talker speech mixture. The prior studies focus mostly on speaker extraction from a highly overlapped multi-talker speech mixture. However, the target-interference speaker overlapping ratios could vary over a wide range from 0% to 100% in natural speech communication, furthermore, the target speaker could be absent in the speech mixture, the speech mixtures in such universal multi-talker scenarios are described as general speech mixtures. The speaker extraction algorithm requires an auxiliary reference, such as a video recording or a pre-recorded speech, to form top-down auditory attention on the target speaker. We advocate that a visual cue, i.e., lip movement, is more informative than an audio cue, i.e., pre-recorded speech, to serve as the auxiliary reference for speaker extraction in disentangling the target speaker from a general speech mixture. In this paper, we propose a universal speaker extraction network with a visual cue, that works for all multi-talker scenarios. In addition, we propose a scenario-aware differentiated loss function for network training, to balance the network performance over different target-interference speaker pairing scenarios. The experimental results show that our proposed method outperforms various competitive baselines for general speech mixtures in terms of signal fidelity.
翻译:话语提取算法试图从多讲者演讲混合体中提取目标发言者的演讲词。先前的研究主要侧重于从高度重叠的多讲者演讲混合体中提取发言者的声音。然而,在自然语音交流中,目标干预发言者的重叠比例可能大不相同,从0%到100%不等,此外,目标演讲者在语言混合体中可能不存在,这种通用的多讲者情景中的语音混合物被描述为一般语音混合物。发言者提取算法需要辅助性参考,如录像或预先录制的演讲,对目标演讲者形成自上而下的听力关注。我们主张视觉提示,即嘴唇运动比音提示(即预先录制的演讲)更具有信息性,作为发言者在将目标发言者与一般演讲混合体隔绝时的辅助性参考。在本文中,我们提议一个通用的语音提取网络,带有视觉提示,用于所有多讲者情景的情景。此外,我们提议在网络培训中设定一个情景认知差异性损失功能,以平衡网络表现而不是声音信号,即预录式发言,用以显示我们通用基准的测试结果。