Audio-visual automatic speech recognition (AV-ASR) extends the speech recognition by introducing the video modality. In particular, the information contained in the motion of the speaker's mouth is used to augment the audio features. The video modality is traditionally processed with a 3D convolutional neural network (e.g. 3D version of VGG). Recently, image transformer networks arXiv:2010.11929 demonstrated the ability to extract rich visual features for the image classification task. In this work, we propose to replace the 3D convolution with a video transformer video feature extractor. We train our baselines and the proposed model on a large scale corpus of the YouTube videos. Then we evaluate the performance on a labeled subset of YouTube as well as on the public corpus LRS3-TED. Our best model video-only model achieves the performance of 34.9% WER on YTDEV18 and 19.3% on LRS3-TED which is a 10% and 9% relative improvements over the convolutional baseline. We achieve the state of the art performance of the audio-visual recognition on the LRS3-TED after fine-tuning our model (1.6% WER).
翻译:视听自动语音识别(AV-ASR)通过引入视频模式扩展了语音识别(AV-ASR),特别是,发言者口部运动中所含的信息被用于增强音频功能。视频模式传统上由3D进化神经网络(如VGG的3D版本)处理。最近,图像变压器网络arXiv:2010.11929展示了为图像分类任务提取丰富的视觉特征的能力。在这项工作中,我们提议用视频变压器视频特征提取器取代3D演动。我们用YouTube视频大规模组合来培训我们的基线和拟议模型。然后,我们评估了YouTube上一个标签的子集上以及LRS3-TED的性能。我们最好的视频模式在YTDEV18上实现了34.9%的WER,在LRS3-TED上实现了19.3%的性能,该模型比革命基线提高了10%和9%的相对改进率。我们实现了LRS3-TED模型(1.6% WER)在微调后对视听识别的状态。