Audio-visual automatic speech recognition (AV-ASR) is an extension of ASR that incorporates visual cues, often from the movements of a speaker's mouth. Unlike works that simply focus on the lip motion, we investigate the contribution of entire visual frames (visual actions, objects, background etc.). This is particularly useful for unconstrained videos, where the speaker is not necessarily visible. To solve this task, we propose a new sequence-to-sequence AudioVisual ASR TrAnsformeR (AVATAR) which is trained end-to-end from spectrograms and full-frame RGB. To prevent the audio stream from dominating training, we propose different word-masking strategies, thereby encouraging our model to pay attention to the visual stream. We demonstrate the contribution of the visual modality on the How2 AV-ASR benchmark, especially in the presence of simulated noise, and show that our model outperforms all other prior work by a large margin. Finally, we also create a new, real-world test bed for AV-ASR called VisSpeech, which demonstrates the contribution of the visual modality under challenging audio conditions.
翻译:视听自动语音识别(AV-ASR)是ASR(AV-ASR)的延伸,它包含视觉提示,通常是来自发言者嘴部运动的视觉提示。与仅仅侧重于嘴唇运动的作品不同,我们调查整个视觉框架(视觉动作、物体、背景等)的贡献。这对于不受限制的录像特别有用,因为演讲者不一定可见。为了完成这项任务,我们提议一个新的序列到序列的音频视频ASR TrAnsformeR(AVATAR)(AVATAR),它受过来自光谱和全框架RGB(RGB)的终端到端到端的训练。为了防止音频流主导培训,我们提出了不同的文字制作战略,从而鼓励我们关注视觉流的模式。我们展示了“AV-ASR”基准的视觉模式的贡献,特别是在有模拟噪音的情况下,并显示我们的模型大大超越了先前所有其他工作。最后,我们还为AV-ASR(V-ASR)创建了一个新的真实世界测试床,这显示了视频模式在有挑战的音频条件下的贡献。