The speaker extraction algorithm extracts the target speech from a mixture speech containing interference speech and background noise. The extraction process sometimes over-suppresses the extracted target speech, which not only creates artifacts during listening but also harms the performance of downstream automatic speech recognition algorithms. We propose a hybrid continuity loss function for time-domain speaker extraction algorithms to settle the over-suppression problem. On top of the waveform-level loss used for superior signal quality, i.e., SI-SDR, we introduce a multi-resolution delta spectrum loss in the frequency-domain, to ensure the continuity of an extracted speech signal, thus alleviating the over-suppression. We examine the hybrid continuity loss function using a time-domain audio-visual speaker extraction algorithm on the YouTube LRS2-BBC dataset. Experimental results show that the proposed loss function reduces the over-suppression and improves the word error rate of speech recognition on both clean and noisy two-speakers mixtures, without harming the reconstructed speech quality.
翻译:语音提取算法从含有干扰言词和背景噪音的混合演讲中提取目标语言。 提取过程有时过度压压制提取的目标语言, 不仅在监听期间制造人工制品, 而且还损害下游自动语音识别算法的性能。 我们建议对时间- 域语音提取算法设定混合连续性损失函数, 以解决过度压抑问题 。 在用于高级信号质量的波形水平损失之外, 即 SI- SDR, 我们在频域引入多分辨率三角洲频谱损失, 以确保提取的语音信号的连续性, 从而缓解过度压缩。 我们使用YouTube LRS2- BBC 数据集上的时间- 光学视听扬声器提取算法来检查混合连续性损失函数 。 实验结果显示, 拟议的损失函数会减少过度压抑, 提高在清洁和吵闹的两发声器混合物上语音识别的词错率, 同时又不会损害已重建的语音质量 。