Recently, the target speech separation or extraction techniques under the meeting scenario have become a hot research trend. We propose a speaker diarization aware multiple target speech separation system (SD-MTSS) to simultaneously extract the voice of each speaker from the mixed speech, rather than requiring a succession of independent processes as presented in previous solutions. SD-MTSS consists of a speaker diarization (SD) module and a multiple target speech separation (MTSS) module. The former one infers the target speaker voice activity detection (TSVAD) states of the mixture, as well as gets different speakers' single-talker audio segments as the reference speech. The latter one employs both the mixed audio and reference speech as inputs, and then it generates an estimated mask. By exploiting the TSVAD decision and the estimated mask, our SD-MTSS model can extract the speech of each speaker concurrently in a conversion recording without additional enrollment audio in advance.Experimental results show that our MTSS model outperforms our baselines with a large margin, achieving 1.38dB SDR, 1.34dB SI-SNR, and 0.13 PESQ improvements over the state-of-the-art SpEx+ baseline on the WSJ0-2mix-extr dataset, respectively. The SD-MTSS system makes a significant improvement than the baseline on the Alimeeting dataset as well.
翻译:最近,会议设想情景下的目标语音分离或提取技术已成为一个热研究趋势。我们建议让发言者分解了解多目标语音分离系统(SD-MTSS),以同时从混合演讲中提取每个发言者的声音,而不是像先前解决方案中显示的那样需要一系列独立程序。SD-MTSS模式由一个发言者分解模块和一个多目标语音分离模块组成。前一位推断出,目标发言者对混合物的语音活动检测(TSVAD)状态,并获得不同发言者的单讲音部分作为参考演讲。后一位使用混合音频和参考演讲作为投入,然后生成一个估计面罩。通过利用TSVAD决定和估计面罩,我们的SD-MSS模型可以同时在转换记录中提取每个发言者的语音,而无需事先增加录制音频谱。 显性结果显示,我们的MSS模型比我们的基线大差,实现了1.38dB特别提款权、1.34dB SI-SNR和0.13 PESQ改进了节音频。通过利用TVA决定和估计面系统,使SD-M-MIS的基线比S-MS-MS-MS-MS-S-S-S-SB改进了州基准系统改进。