Speech separation has been successfully applied as a frontend processing module of conversation transcription systems thanks to its ability to handle overlapped speech and its flexibility to combine with downstream tasks such as automatic speech recognition (ASR). However, a speech separation model often introduces target speech distortion, resulting in a sub-optimum word error rate (WER). In this paper, we describe our efforts to improve the performance of a single channel speech separation system. Specifically, we investigate a two-stage training scheme that firstly applies a feature level optimization criterion for pretraining, followed by an ASR-oriented optimization criterion using an end-to-end (E2E) speech recognition model. Meanwhile, to keep the model light-weight, we introduce a modified teacher-student learning technique for model compression. By combining those approaches, we achieve a absolute average WER improvement of 2.70% and 0.77% using models with less than 10M parameters compared with the previous state-of-the-art results on the LibriCSS dataset for utterance-wise evaluation and continuous evaluation, respectively
翻译:由于能够处理重叠的演讲及其与自动语音识别(ASR)等下游任务相结合的灵活性,语音分离模式被成功地用作对话记录系统前端处理模块。然而,语音分离模式往往引入目标言语扭曲,导致次优字错误率。在本文中,我们描述了我们为改善单一频道语音分离系统绩效所作的努力。具体地说,我们调查了两阶段培训计划,首先对培训前培训适用地平级优化标准,然后采用端至端语音识别模式,面向ASR的优化标准。与此同时,为了保持示范轻度,我们采用了经修改的师生学习技术,用于模式压缩。我们将这些方法结合起来,我们实现了绝对平均WER改进2.70%和0.77%,使用低于10M参数的模型,而使用LibriCSS数据集的高级结果,分别用于语音评估和持续评估。