This paper focuses on designing a noise-robust end-to-end Audio-Visual Speech Recognition (AVSR) system. To this end, we propose Visual Context-driven Audio Feature Enhancement module (V-CAFE) to enhance the input noisy audio speech with a help of audio-visual correspondence. The proposed V-CAFE is designed to capture the transition of lip movements, namely visual context and to generate a noise reduction mask by considering the obtained visual context. Through context-dependent modeling, the ambiguity in viseme-to-phoneme mapping can be refined for mask generation. The noisy representations are masked out with the noise reduction mask resulting in enhanced audio features. The enhanced audio features are fused with the visual features and taken to an encoder-decoder model composed of Conformer and Transformer for speech recognition. We show the proposed end-to-end AVSR with the V-CAFE can further improve the noise-robustness of AVSR. The effectiveness of the proposed method is evaluated in noisy speech recognition and overlapped speech recognition experiments using the two largest audio-visual datasets, LRS2 and LRS3.
翻译:本文的重点是设计一个噪音-摄盘端到端的语音-视频语音识别系统(AVSR),为此,我们提出视觉-环境驱动音频增强模块(V-CAFE),在视听通信的帮助下,加强音频扩增,以加强音频扩增;拟议的V-CAFE旨在捕捉嘴唇运动的转变,即视觉背景,并通过考虑获得的视觉背景来产生减少噪音的遮罩;通过根据背景的建模,对面-电话绘图的模糊性可以为生成面罩进行改进;用扩增音特征的减少噪音掩罩遮住噪音的显示;增强的音频特征与视觉特征结合,并被带入由声音识别的连接器和变异器组成的编码解码器模型;我们展示了与V-CAFE的拟议终端-VSR与V-CAFE可以进一步提高AVSR的噪音-紫外线性;利用两个最大的视听数据集(LRS2和LRS3),在噪音语音识别和重叠语音识别实验中评估拟议方法的有效性。