While permutation invariant training (PIT) based continuous speech separation (CSS) significantly improves the conversation transcription accuracy, it often suffers from speech leakages and failures in separation at "hot spot" regions because it has a fixed number of output channels. In this paper, we propose to apply recurrent selective attention network (RSAN) to CSS, which generates a variable number of output channels based on active speaker counting. In addition, we propose a novel block-wise dependency extension of RSAN by introducing dependencies between adjacent processing blocks in the CSS framework. It enables the network to utilize the separation results from the previous blocks to facilitate the current block processing. Experimental results on the LibriCSS dataset show that the RSAN-based CSS (RSAN-CSS) network consistently improves the speech recognition accuracy over PIT-based models. The proposed block-wise dependency modeling further boosts the performance of RSAN-CSS.
翻译:虽然以连续语音分离为基础的变异性培训(变异性培训)大大提高了谈话记录准确性,但往往会因为“热点”区域有固定数量的产出渠道而出现语言渗漏和分离失败。在本文中,我们提议对CSS采用经常性选择性关注网络(RSAN),这会产生基于主动语音计数的可变产出渠道。此外,我们提议通过在CSS框架内引入相邻加工区块之间的依赖性,对RSAN进行新的分块式依赖性扩展,使网络能够利用前几个区块的分离结果促进目前的区块处理。LibriCSS数据集的实验结果表明,基于RSAN的CSS(RSAN-CSS)网络不断提高基于PIT模型的语音识别准确性。拟议的块性依赖性模型进一步提升了RSAN-CSS的性能。