This study addresses robust automatic speech recognition (ASR) by introducing a Conformer-based acoustic model. The proposed model builds on the wide residual bi-directional long short-term memory network (WRBN) with utterance-wise dropout and iterative speaker adaptation, but employs a Conformer encoder instead of the recurrent network. The Conformer encoder uses a convolution-augmented attention mechanism for acoustic modeling. The proposed system is evaluated on the monaural ASR task of the CHiME-4 corpus. Coupled with utterance-wise normalization and speaker adaptation, our model achieves $6.25\%$ word error rate, which outperforms WRBN by $8.4\%$ relatively. In addition, the proposed Conformer-based model is $18.3\%$ smaller in model size and reduces total training time by $79.6\%$.
翻译:本研究通过引入一个基于电源的声学模型,解决了强力自动语音识别(ASR)的问题;拟议的模型以广泛的双向长期短时间存储网络为基础,采用发音退出和迭代扬声器调整法,但采用连接编码器,而不是经常性网络; 连接编码器使用一个声学模型的增强关注机制; 拟议的系统根据CHime-4系统修道院的声学ASR任务进行评估; 结合语法正常化和扬声器调整,我们的模式实现了6.25美元单词误差率,比RWBN少8.4美元; 此外,拟议的连接编码器模型规模小18.3美元,总培训时间减少79.6美元。