Speaker extraction aims to extract the target speaker's voice from a multi-talker speech mixture given an auxiliary reference utterance. Recent studies show that speaker extraction benefits from the location or direction of the target speaker. However, these studies assume that the target speaker's location is known in advance or detected by an extra visual cue, e.g., face image or video. In this paper, we propose an end-to-end localized target speaker extraction on pure speech cues, that is called L-SpEx. Specifically, we design a speaker localizer driven by the target speaker's embedding to extract the spatial features, including direction-of-arrival (DOA) of the target speaker and beamforming output. Then, the spatial cues and target speaker's embedding are both used to form a top-down auditory attention to the target speaker. Experiments on the multi-channel reverberant dataset called MC-Libri2Mix show that our L-SpEx approach significantly outperforms the baseline system.
翻译:发言人的抽取旨在从多讲者演讲混合中提取目标发言者的声音,并配以辅助参考语句。最近的研究表明,发言者从目标发言者的位置或方向抽取好处,然而,这些研究假定,目标发言者的位置是事先知道的,或者通过额外的视觉提示(如脸部图像或视频)检测到的。在本文中,我们建议对纯语音提示(即L-SpEx)进行端到端的局部目标发言者抽取。具体地说,我们设计了一个由目标发言者嵌入来提取空间特征(包括目标发言者的抵达方向和成型输出)驱动的发言者本地化器。然后,空间提示和目标发言者的嵌入都被用来形成对目标发言者的自上至下听觉关注。多频道的实验称为 MC-Libri2Mix, 显示我们的L-SpEx方法大大超出基线系统。