Informed speaker extraction aims to extract a target speech signal from a mixture of sources given prior knowledge about the desired speaker. Recent deep learning-based methods leverage a speaker discriminative model that maps a reference snippet uttered by the target speaker into a single embedding vector that encapsulates the characteristics of the target speaker. However, such modeling deliberately neglects the time-varying properties of the reference signal. In this work, we assume that a reference signal is available that is temporally correlated with the target signal. To take this correlation into account, we propose a time-varying source discriminative model that captures the temporal dynamics of the reference signal. We also show that existing methods and the proposed method can be generalized to non-speech sources as well. Experimental results demonstrate that the proposed method significantly improves the extraction performance when applied in an acoustic echo reduction scenario.
翻译:最近的深层学习方法利用一个有区别的演讲者模型,将目标演讲者所写的参考片段映射成一个包含目标演讲者特点的单一嵌入矢量,但这种模型故意忽略了参考信号的时间分配特性。在这项工作中,我们假定有一个参考信号与目标信号有时间相关性。为了考虑到这一相关性,我们提议一个有时间变化的有区别的模型,捕捉参考信号的时间动态。我们还表明,现有的方法和拟议方法也可以推广到非声音源。实验结果显示,在声音回声减少情景中应用时,拟议方法大大改进了提取功能。