A speaker extraction algorithm seeks to extract the target speaker's voice from a multi-talker speech mixture. An auxiliary reference, such as a video recording or a pre-recorded speech, is usually used as a cue to form a top-down auditory attention. The prior studies are focused mostly on speaker extraction from a multi-talker speech mixture with highly overlapping speakers. However, a multi-talker speech mixture is often sparsely overlapped, furthermore, the target speaker could even be absent sometimes. In this paper, we propose a universal speaker extraction network that works for all multi-talker scenarios, where the target speaker can be either absent or present. When the target speaker is present, the network performs over a wide range of target-interference speaker overlapping ratios, from 0% to 100%. The speech in such universal multi-talker scenarios is generally described as sparsely overlapped speech. We advocate that a visual cue, i.e. lips movement, is more informative to serve as the auxiliary reference than an audio cue, i.e. pre-recorded speech. In addition, we propose a scenario-aware differentiated loss function for network training. The experimental results show that our proposed network outperforms various competitive baselines in disentangling sparsely overlapped speech in terms of signal fidelity and perceptual evaluations.
翻译:语音提取算法试图从多讲者演讲混合体中提取目标发言者的声音。 辅助性参考, 如录制或预录的演讲, 通常用作形成自上而下听力关注的提示。 先前的研究主要侧重于从多讲者演讲混合体中提取声音, 且发言者高度重叠。 但是, 多讲者演讲混合体往往很少重叠, 而且, 目标演讲者有时甚至会缺席 。 本文中, 我们提议建立一个通用的语音提取网络网络, 为所有多讲者情景( 目标演讲者可能缺席或在场) 工作。 当目标演讲者在场时, 网络在目标- 干预演讲者重复率的广泛范围上运行, 从0 % 到100%。 这种通用的多讲者演讲通常被描述为很少重叠的演讲。 我们主张, 视觉提示, 即嘴唇移动, 要比声音提示( 即预录制的演讲者演讲) 更有助于作为辅助性参考。 此外, 我们提议在网络培训中设置一个场景分化损失功能的情景功能功能。 实验结果显示, 我们的信号性信号性基准值显示, 我们的网络的拟议格式超越了各种频谱格式。