In this work, we propose Exformer, a time-domain architecture for target speaker extraction. It consists of a pre-trained speaker embedder network and a separator network based on transformer encoder blocks. We study multiple methods to combine speaker information with the input mixture, and the resulting Exformer architecture obtains superior extraction performance compared to prior time-domain networks. Furthermore, we investigate a two-stage procedure to train the model using mixtures without reference signals upon a pre-trained supervised model. Experimental results show that the proposed semi-supervised learning procedure improves the performance of the supervised baselines.
翻译:在这项工作中,我们提议Exex(一个用于目标扬声器提取的时域结构),其中包括一个预先培训的扬声器嵌入器网络和一个基于变压器编码器区块的分离器网络。我们研究多种方法,将扬声器信息与输入混合结合起来,由此形成的Exexexex结构与以前的时间域网络相比,取得了优异的提取性能。此外,我们还调查了两阶段程序,用混合物来培训模型,而无需在经过培训的受监督模型上提供参考信号。实验结果显示,拟议的半监督学习程序改善了受监督基线的性能。