In this paper, we present a novel multi-channel speech extraction system to simultaneously extract multiple clean individual sources from a mixture in noisy and reverberant environments. The proposed method is built on an improved multi-channel time-domain speech separation network which employs speaker embeddings to identify and extract multiple targets without label permutation ambiguity. To efficiently inform the speaker information to the extraction model, we propose a new speaker conditioning mechanism by designing an additional speaker branch for receiving external speaker embeddings. Experiments on 2-channel WHAMR! data show that the proposed system improves by 9% relative the source separation performance over a strong multi-channel baseline, and it increases the speech recognition accuracy by more than 16% relative over the same baseline.
翻译:在本文中,我们提出了一个新型的多通道语音提取系统,以同时从噪音和回响环境中的混合物中提取多种清洁的单个来源。拟议方法建立在经过改进的多通道时空语音分离网络上,该网络使用扩音器嵌入来识别和提取多个目标,而没有标签的模糊性。为了有效地将发言者信息告知提取模型,我们提议了一个新的语音调节机制,为接收外部语音嵌入设计了一个额外的发言者分支。WHAMR2频道的实验!数据显示,拟议的系统比强的多通道基线的源分离性能提高了9%的相对性能,并且使语音识别准确性比同一基线提高16%以上。