Target speech extraction (TSE) systems are designed to extract target speech from a multi-talker mixture. The popular training objective for most prior TSE networks is to enhance reconstruction performance of extracted speech waveform. However, it has been reported that a TSE system delivers high reconstruction performance may still suffer low-quality experience problems in practice. One such experience problem is wrong speaker extraction (called speaker confusion, SC), which leads to strong negative experience and hampers effective conversations. To mitigate the imperative SC issue, we reformulate the training objective and propose two novel loss schemes that explore the metric of reconstruction improvement performance defined at small chunk-level and leverage the metric associated distribution information. Both loss schemes aim to encourage a TSE network to pay attention to those SC chunks based on the said distribution information. On this basis, we present X-SepFormer, an end-to-end TSE model with proposed loss schemes and a backbone of SepFormer. Experimental results on the benchmark WSJ0-2mix dataset validate the effectiveness of our proposals, showing consistent improvements on SC errors (by 14.8% relative). Moreover, with SI-SDRi of 19.4 dB and PESQ of 3.81, our best system significantly outperforms the current SOTA systems and offers the top TSE results reported till date on the WSJ0-2mix.
翻译:目标语音提取系统(TSE)旨在从多听话者混合体中提取目标演讲。大多数前TES网络的普及培训目标是提高提取的语音波形的重建性能。然而,据报告,TSE系统提供高质量的重建性能,实际上可能仍然遇到低质量的经验问题。这种经验问题之一是错误的语音提取(所谓的扬声器混乱,SC),导致强烈的负面经验并妨碍有效对话。为减轻在SC问题上的紧迫问题,我们重新制定培训目标,并提出两个新的损失计划,探索小块级界定的重建改善绩效衡量标准,并利用与指标相关的分发信息。两种损失计划都旨在鼓励TSE网络根据上述分发信息关注这些SC块。在此基础上,我们提出X-SEFormer,即端到端的TSESE模型, 与拟议的损失计划相联,以及SWJ0-2mix数据集的实验结果证实了我们提案的有效性,显示在SC级小块级层面的改进(相对14.8%)。此外,据SI-SDRE网络报告,19.4 dMA系统的最新日期为SISMA。</s>