The performance of automatic speech recognition (ASR) systems severely degrades when multi-talker speech overlap occurs. In meeting environments, speech separation is typically performed to improve the robustness of ASR systems. Recently, location-based training (LBT) was proposed as a new training criterion for multi-channel talker-independent speaker separation. Assuming fixed array geometry, LBT outperforms widely-used permutation-invariant training in fully overlapped utterances and matched reverberant conditions. This paper extends LBT to conversational multi-channel speaker separation. We introduce multi-resolution LBT to estimate the complex spectrograms from low to high time and frequency resolutions. With multi-resolution LBT, convolutional kernels are assigned consistently based on speaker locations in physical space. Evaluation results show that multi-resolution LBT consistently outperforms other competitive methods on the recorded LibriCSS corpus.
翻译:自动语音识别系统(ASR)的性能在多对讲器语音重叠时会严重降低。在会议环境中,语音分离通常是为了提高ASR系统的稳健性。最近,基于地点的培训(LBT)被提议为多频道谈话器独立扬声器分离的新培训标准。假设固定阵列的几何,LBT在完全重叠的语句和匹配的反动条件中优于广泛使用的变异性培训。本文将LBT扩大到对话式多声道扬声器分离。我们引入了多分辨率LBT来估计低至高时间和频的复杂光谱。随着多分辨率LBT,动态内核是按物理空间的发言者位置一致分配的。评价结果显示,多分辨率LBT在完全重叠的语句和相应的反动性培训中,始终优于所记录的LBCSS文库上的其他竞争性方法。