Audiovisual automatic speech recognition (AV-ASR) aims to improve the robustness of a speech recognition system by incorporating visual information. Training fully supervised multimodal models for this task from scratch, however is limited by the need for large labelled audiovisual datasets (in each downstream domain of interest). We present AVFormer, a simple method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation. We do this by (i) injecting visual embeddings into a frozen ASR model using lightweight trainable adaptors. We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters. (ii) We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively; and finally (iii) we show that our model achieves state of the art zero-shot results on three different AV-ASR benchmarks (How2, VisSpeech and Ego4D), while also crucially preserving decent performance on traditional audio-only speech recognition benchmarks (LibriSpeech). Qualitative results show that our model effectively leverages visual information for robust speech recognition.
翻译:语音视觉自动语音识别(AV-ASR)通过整合视觉信息的方式提高了语音识别系统的鲁棒性。然而,从头开始训练一个全监督的多模态模型,需要大规模标注的音频-视觉数据集,而这在各个下游领域中都存在限制。本文提出AVFormer,一种向音频模型注入视觉信息并轻量级域自适应的简单方法。具体实践中,我们通过以下三个步骤实现了这一目标:(i) 使用轻量级可训练的适配器将视觉嵌入注入到音频ASR模型中;该适配器可以使用少量的弱标签视频数据进行训练,并且所需的额外训练时间和参数都极少。(ii) 我们同时采用了一种“课程学习”训练方法——在训练过程中增加渐进难度的任务——使用它能够更好地让音频和视觉信息在模型中获得有效的联合处理。(iii) 我们在三个AV-ASR基准数据集(How2、VisSpeech、Ego4D)上展示了AVFormer模型实现了零样本测试的最新结果——并且在传统的音频语音识别基准测试集(LibriSpeech)上也具备很好的性能表现。定性的实验结果表明,我们的模型能够有效地利用视觉信息提高语音识别的稳健性。