SynthVSR：通过合成监督将视觉语音识别扩展至更大规模 (SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision)

Xubo Liu,Egor Lakomkin,Konstantinos Vougioukas,Pingchuan Ma,Honglie Chen,Ruiming Xie,Morrie Doulaty,Niko Moritz,Jáchym Kolář,Stavros Petridis,Maja Pantic,Christian Fuegen

from arxiv, IEEE/CVF CVPR 2023

Recently reported state-of-the-art results in visual speech recognition (VSR) often rely on increasingly large amounts of video data, while the publicly available transcribed video datasets are limited in size. In this paper, for the first time, we study the potential of leveraging synthetic visual data for VSR. Our method, termed SynthVSR, substantially improves the performance of VSR systems with synthetic lip movements. The key idea behind SynthVSR is to leverage a speech-driven lip animation model that generates lip movements conditioned on the input speech. The speech-driven lip animation model is trained on an unlabeled audio-visual dataset and could be further optimized towards a pre-trained VSR model when labeled videos are available. As plenty of transcribed acoustic data and face images are available, we are able to generate large-scale synthetic data using the proposed lip animation model for semi-supervised VSR training. We evaluate the performance of our approach on the largest public VSR benchmark - Lip Reading Sentences 3 (LRS3). SynthVSR achieves a WER of 43.3% with only 30 hours of real labeled data, outperforming off-the-shelf approaches using thousands of hours of video. The WER is further reduced to 27.9% when using all 438 hours of labeled data from LRS3, which is on par with the state-of-the-art self-supervised AV-HuBERT method. Furthermore, when combined with large-scale pseudo-labeled audio-visual data SynthVSR yields a new state-of-the-art VSR WER of 16.9% using publicly available data only, surpassing the recent state-of-the-art approaches trained with 29 times more non-public machine-transcribed video data (90,000 hours). Finally, we perform extensive ablation studies to understand the effect of each component in our proposed method.

翻译：最近报道的视觉语音识别（VSR）领域的最先进结果通常依赖于越来越大量的视频数据，而公开可用的转录视频数据集的规模有限。在本文中，我们首次研究了利用合成视觉数据来提高VSR性能的潜力。我们的方法名为SynthVSR，使用合成唇部动作大幅提高了VSR系统的性能。SynthVSR的关键思路是利用一种语音驱动的唇部动画模型，根据输入语音生成唇部动作。该语音驱动的唇部动画模型是在一个未标记的音频-视觉数据集上进行训练的，并且可以在有标记视频时进一步优化为预训练的VSR模型。由于有大量转录的声学数据和面部图像可用，我们能够使用所提出的唇部动画模型生成大规模的合成数据用于半监督VSR训练。我们在最大的公开VSR基准测试集——Lip Reading Sentences 3（LRS3）上评估了我们的方法的性能。在只有30小时真实标记数据的情况下，SynthVSR达到了43.3%的WER，优于使用成千上万小时视频的现成方法。当使用LRS3的所有438小时标记数据时，WER进一步降至27.9％，与最先进的自监督AV-HuBERT方法持平。此外，当与大规模伪标记的音频-视觉数据相结合时，SynthVSR仅使用公开可用的数据就能够达到16.9%的新的最先进VSR WER，超越了使用29倍非公开机器转录视频数据（90,000小时）训练的最近最先进方法。最后，我们进行了广泛的消融研究，以了解我们提出的方法中每个组件的影响。