Many of the recent advances in speech separation are primarily aimed at synthetic mixtures of short audio utterances with high degrees of overlap. Most of these approaches need an additional stitching step to stitch the separated speech chunks for long form audio. Since most of the approaches involve Permutation Invariant training (PIT), the order of separated speech chunks is nondeterministic and leads to difficulty in accurately stitching homogenous speaker chunks for downstream tasks like Automatic Speech Recognition (ASR). Also, most of these models are trained with synthetic mixtures and do not generalize to real conversational data. In this paper, we propose a speaker conditioned separator trained on speaker embeddings extracted directly from the mixed signal using an over-clustering based approach. This model naturally regulates the order of the separated chunks without the need for an additional stitching step. We also introduce a data sampling strategy with real and synthetic mixtures which generalizes well to real conversation speech. With this model and data sampling technique, we show significant improvements in speaker-attributed word error rate (SA-WER) on Hub5 data.
翻译:语音分隔方面的许多最新进展主要是针对合成短音量的合成混合物,其高度重叠。这些方法大多需要额外的缝合步骤来缝合隔开的语音块以备长式音频。由于大多数方法涉及变异性训练(PIT),隔开的语音块的顺序是不确定的,导致难以准确缝合像自动语音识别(ASR)这样的下游任务中同质音块。此外,这些模型大多是用合成混合物训练的,并不概括真实的谈话数据。在本文件中,我们建议用一个扩音器专门为用混合信号直接从混合信号中提取的语音嵌入而培训的静音器。这个模型自然地调节了分离块的顺序,而不需要额外的缝合步骤。我们还采用了一种数据取样战略,用真实和合成混合物来概括真实的谈话话语调。我们用这种模型和数据取样技术,在HF5数据上显示由发言者提供的单词错误率(SA-WER)有了显著的改进。