In this paper, we explore an improved framework to train a monoaural neural enhancement model for robust speech recognition. The designed training framework extends the existing mixture invariant training criterion to exploit both unpaired clean speech and real noisy data. It is found that the unpaired clean speech is crucial to improve quality of separated speech from real noisy speech. The proposed method also performs remixing of processed and unprocessed signals to alleviate the processing artifacts. Experiments on the single-channel CHiME-3 real test sets show that the proposed method improves significantly in terms of speech recognition performance over the enhancement system trained either on the mismatched simulated data in a supervised fashion or on the matched real data in an unsupervised fashion. Between 16% and 39% relative WER reduction has been achieved by the proposed system compared to the unprocessed signal using end-to-end and hybrid acoustic models without retraining on distorted data.
翻译:在本文中,我们探索了一个更好的框架来训练一个单体神经强化模型,以进行强力语音识别。设计的培训框架扩展了现有的混合变异训练标准,以利用未受重视的清洁言语和真正吵闹的数据。发现未受重视的清洁言语对于提高语言与真正吵闹的言语之间的分离质量至关重要。拟议方法还对经过处理和未经处理的信号进行重新组合,以缓解加工的文物。在单道CHiME-3实际测试组上进行的实验表明,拟议的方法大大改进了强化系统的语言识别性能,无论是以监督方式对不匹配的模拟数据,还是以未受监督的方式对匹配的真实数据进行了培训。 与未处理的信号相比,拟议系统实现了16%至39%的相对WER减排率,而未对扭曲的数据进行再培训的终端至终端和混合声学模型则实现了非处理信号的相对减排率。