Compensation for channel mismatch and noise interference is essential for robust automatic speech recognition. Enhanced speech has been introduced into the multi-condition training of acoustic models to improve their generalization ability. In this paper, a noise-aware training framework based on two cascaded neural structures is proposed to jointly optimize speech enhancement and speech recognition. The feature enhancement module is composed of a multi-task autoencoder, where noisy speech is decomposed into clean speech and noise. By concatenating its enhanced, noise-aware, and noisy features for each frame, the acoustic-modeling module maps each feature-augmented frame into a triphone state by optimizing the lattice-free maximum mutual information and cross entropy between the predicted and actual state sequences. On top of the factorized time delay neural network (TDNN-F) and its convolutional variant (CNN-TDNNF), both with SpecAug, the two proposed systems achieve word error rate (WER) of 3.90% and 3.55%, respectively, on the Aurora-4 task. Compared with the best existing systems that use bigram and trigram language models for decoding, the proposed CNN-TDNNF-based system achieves a relative WER reduction of 15.20% and 33.53%, respectively. In addition, the proposed CNN-TDNNF-based system also outperforms the baseline CNN-TDNNF system on the AMI task.
翻译:对频道错配和噪声干扰进行补偿是稳健自动语音识别的关键。强化的语音已经引入声学模型的多条件培训,以提高其普及能力。在本文中,基于两个级级神经结构的噪音感知培训框架建议共同优化语音增强和语音识别。功能增强模块由多塔斯克自动coder组成,其中噪音的语音分解成清洁的言词和噪音。通过对每个框架的强化、噪音觉悟和噪音特性进行整合,声学模型将每个地貌放大框架都映射成三声频状态,优化无挂式最大相互信息以及预测和实际状态序列之间的交叉英特质。在因数化时间延迟神经网络(TDNNN-F)及其演变变(CN-TDNNNF)中,与SpecauA-NF系统(WNF-NF)的拟议最大和三基语言模式相比,A-NF系统(CR-NF)将分别缩小15 %。