Target speaker extraction is to extract the target speaker's voice from a mixture of signals according to the given enrollment utterance. The target speaker's enrollment utterance is also called as anchor speech. The effective utilization of anchor speech is crucial for speaker extraction. In this study, we propose a new system to exploit speaker information from anchor speech fully. Unlike models that use only local or global features of the anchor, the proposed method extracts speaker information on global and local levels and feeds the features into a speech separation network. Our approach benefits from the complementary advantages of both global and local features, and the performance of speaker extraction is improved. We verified the feasibility of this local-global representation (LGR) method using multiple speaker extraction models. Systematic experiments were conducted on the open-source dataset Libri-2talker, and the results showed that the proposed method significantly outperformed the baseline models.
翻译:目标扬声器的提取是为了根据特定注册语句从混合信号中提取目标扬声器的声音。目标扬声器的注册语句也被称为“锚言”。有效利用锚言语对于发言者的提取至关重要。在本研究中,我们建议建立一个新系统,充分利用锚言语中的声音信息。与只使用锚语的地方或全球特征的模型不同,拟议方法提取了关于全球和当地水平的演讲者信息,并将这些特征输入语音分离网络。我们的方法得益于全球和当地特征的互补优势,而发言者的提取性能也得到了改进。我们用多位扬声器提取模型核实了这种地方-全球代表法的可行性。对开放源数据集Libri-2Talker进行了系统实验,结果显示,拟议的方法大大优于基线模型。