Continuous speech separation plays a vital role in complicated speech related tasks such as conversation transcription. The separation model extracts a single speaker signal from a mixed speech. In this paper, we use transformer and conformer in lieu of recurrent neural networks in the separation system, as we believe capturing global information with the self-attention based method is crucial for the speech separation. Evaluating on the LibriCSS dataset, the conformer separation model achieves state of the art results, with a relative 23.5% word error rate (WER) reduction from bi-directional LSTM (BLSTM) in the utterance-wise evaluation and a 15.4% WER reduction in the continuous evaluation.
翻译:连续语音分离在复杂的语音相关任务(如对话记录)中发挥着关键作用。 分离模型从混合演讲中提取了一个单一的扬声器信号。 在本文中,我们使用变压器和相容器代替分离系统中的经常性神经网络,因为我们认为,用以自我注意为基础的方法获取全球信息对于语言分离至关重要。 在对LibriCSS数据集的评估中, 匹配器分离模型实现了最新的结果, 其从双向LSTM(BLSTM)中相对减少了23.5%的单词错误率(WER), 在持续评估中则减少了15.4%的WER。