End-to-end Automatic Speech Recognition (ASR) systems based on neural networks have seen large improvements in recent years. The availability of large scale hand-labeled datasets and sufficient computing resources made it possible to train powerful deep neural networks, reaching very low Word Error Rate (WER) on academic benchmarks. However, despite impressive performance on clean audio samples, a drop of performance is often observed on noisy speech. In this work, we propose to improve the noise robustness of the recently proposed Efficient Conformer Connectionist Temporal Classification (CTC)-based architecture by processing both audio and visual modalities. We improve previous lip reading methods using an Efficient Conformer back-end on top of a ResNet-18 visual front-end and by adding intermediate CTC losses between blocks. We condition intermediate block features on early predictions using Inter CTC residual modules to relax the conditional independence assumption of CTC-based models. We also replace the Efficient Conformer grouped attention by a more efficient and simpler attention mechanism that we call patch attention. We experiment with publicly available Lip Reading Sentences 2 (LRS2) and Lip Reading Sentences 3 (LRS3) datasets. Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps. Our Audio-Visual Efficient Conformer (AVEC) model achieves state-of-the-art performance, reaching WER of 2.3% and 1.8% on LRS2 and LRS3 test sets. Code and pretrained models are available at https://github.com/burchim/AVEC.
翻译:近些年来,基于神经网络的端到端自动语音识别系统(ASR)有了很大的改进。大型手贴数据集和充足的计算资源,使得能够培训强大的深神经网络,在学术基准方面达到非常低的单词错误率(WER),然而,尽管在清洁音频样本上的表现给人留下了深刻的印象,但在吵闹的演讲中也经常观察到性能下降。在这项工作中,我们提议通过处理音频和视觉模式来改进最近提议的高效连接连接时间分类(CTC)结构的噪音强度。我们用RANet-18视觉前端的高效连接后端和增加中间的CTC损失来改进先前的唇读方法。我们用InterCT残余模块进行早期预测,以放松基于CT模型的有条件独立假设,我们用一个更高效、更简单的关注机制来取代节能组合的注意。我们用公开的LRSS2号(LRS2)和LS3级读取第3版(LRS3),我们用高效的C级后端点读方法改进了RBS-S-S-Rest Streal Streport Supal Start press report Settyal Proport Settyal laveal lax