Many of the recent advances in speech separation are primarily aimed at synthetic mixtures of short audio utterances with high degrees of overlap. These datasets significantly differ from the real conversational data and hence, the models trained and evaluated on these datasets do not generalize to real conversational scenarios. Another issue with using most of these models for long form speech is the nondeterministic ordering of separated speech segments due to either unsupervised clustering for time-frequency masks or Permutation Invariant training (PIT) loss. This leads to difficulty in accurately stitching homogenous speaker segments for downstream tasks like Automatic Speech Recognition (ASR). In this paper, we propose a speaker conditioned separator trained on speaker embeddings extracted directly from the mixed signal. We train this model using a directed loss which regulates the order of the separated segments. With this model, we achieve significant improvements on Word error rate (WER) for real conversational data without the need for an additional re-stitching step.
翻译:在语音分隔方面最近取得的许多进展主要针对合成的短音音量组合,其高度重叠。这些数据集与真实的谈话数据大不相同,因此,在这些数据集上经过训练和评价的模型并不概括于真实的谈话情景。使用大多数这些模型进行长式语音表达的另一个问题是,由于对时间-频面罩或变异性培训的损失没有监督的组合,分离的语音部分没有确定顺序。这导致难以精确地为下游任务,如自动语音识别(ASR),对同质语音部分进行缝合。在本文中,我们建议用一个通过隔开的信号直接嵌入语音信号而培训的有主机的分隔器。我们用一种直接损失来训练这一模型,以调节分离部分的顺序。使用这一模型,我们大大改进了对真实对话数据的WER错误率,而无需采取额外的重新细化步骤。