Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robustness of speech recognition. Mainstream methods focus on fusing audio and visual inputs to obtain modality-invariant representations. However, such representations are prone to over-reliance on audio modality as it is much easier to recognize than video modality in clean conditions. As a result, the AVSR model underestimates the importance of visual stream in face of noise corruption. To this end, we leverage visual modality-specific representations to provide stable complementary information for the AVSR task. Specifically, we propose a reinforcement learning (RL) based framework called MSRL, where the agent dynamically harmonizes modality-invariant and modality-specific representations in the auto-regressive decoding process. We customize a reward function directly related to task-specific metrics (i.e., word error rate), which encourages the MSRL to effectively explore the optimal integration strategy. Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions. Furthermore, we demonstrate the better generality of MSRL system than other baselines when test set contains unseen noises.
翻译:视听语音识别(AVSR)在改善语音识别的噪音-振荡性方面取得了显著的成功; 主流方法侧重于阻断视听投入,以获得模式-差异性代表; 然而,这种表达方式容易过度依赖音频模式,因为在清洁条件下比视频模式更容易识别; 因此,AVSR模型低估了在噪音腐败面前视觉流的重要性; 为此,我们利用视觉模式特定表达方式为AVSR任务提供稳定的补充信息; 具体地说,我们提议了一个基于强化学习(RL)的框架,称为MSRL, 代理在自动递进式解码过程中动态地协调模式-差异性和模式-特定表达方式; 我们定制了与任务特定指标(即单词误差率)直接相关的奖励功能,鼓励MSRL有效探索最佳整合战略; LRS3数据集的实验结果显示,拟议方法在清洁和各种噪音条件下都达到了最新状态。 此外,我们展示了与高清晰度测试系统相比,高清晰度的MSR基准。