SynthVSR: 通过合成监督扩展视觉语音识别的规模 (SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision)

Xubo Liu,Egor Lakomkin,Konstantinos Vougioukas,Pingchuan Ma,Honglie Chen,Ruiming Xie,Morrie Doulaty,Niko Moritz,Jáchym Kolář,Stavros Petridis,Maja Pantic,Christian Fuegen

from arxiv, CVPR 2023

Recently reported state-of-the-art results in visual speech recognition (VSR) often rely on increasingly large amounts of video data, while the publicly available transcribed video datasets are limited in size. In this paper, for the first time, we study the potential of leveraging synthetic visual data for VSR. Our method, termed SynthVSR, substantially improves the performance of VSR systems with synthetic lip movements. The key idea behind SynthVSR is to leverage a speech-driven lip animation model that generates lip movements conditioned on the input speech. The speech-driven lip animation model is trained on an unlabeled audio-visual dataset and could be further optimized towards a pre-trained VSR model when labeled videos are available. As plenty of transcribed acoustic data and face images are available, we are able to generate large-scale synthetic data using the proposed lip animation model for semi-supervised VSR training. We evaluate the performance of our approach on the largest public VSR benchmark - Lip Reading Sentences 3 (LRS3). SynthVSR achieves a WER of 43.3% with only 30 hours of real labeled data, outperforming off-the-shelf approaches using thousands of hours of video. The WER is further reduced to 27.9% when using all 438 hours of labeled data from LRS3, which is on par with the state-of-the-art self-supervised AV-HuBERT method. Furthermore, when combined with large-scale pseudo-labeled audio-visual data SynthVSR yields a new state-of-the-art VSR WER of 16.9% using publicly available data only, surpassing the recent state-of-the-art approaches trained with 29 times more non-public machine-transcribed video data (90,000 hours). Finally, we perform extensive ablation studies to understand the effect of each component in our proposed method.

翻译：近期视觉语音识别(VSR)领域报告了越来越多基于视频数据的最新成果，但公开可获得的抄写视频数据集的规模是有限的。本文首次研究了利用合成视觉数据进行VSR的潜力。我们的方法SynthVSR，通过合成唇部动作实现了VSR系统性能的大幅提升。SynthVSR方法的关键思想是利用以语音为驱动的唇形动画模型，该模型能够生成与输入语音相对应的唇部运动。该语音驱动的唇形动画模型在未标记的音视频数据集上进行训练，并可通过预先训练的VSR模型进一步优化，当提供了标记视频时可进行必要的调整。由于有大量抄写语音数据和面部图像可用，我们能够使用提出的唇形动画模型生成大规模合成数据，用于半监督学习的VSR训练。我们在公开最大的VSR基准数据集Lip Reading Sentences 3（LRS3）上评估了我们方法的性能。SynthVSR在只使用30小时真实标记数据的情况下，实现了43.3%的WER，超过使用数千小时视频的现成方法。当利用LRS3提供的全部438小时标记数据时，WER进一步降至27.9%，与最新的无监督自学习AV-HuBERT方法的表现相当。此外，当结合大规模伪标签的音视频数据时，SynthVSR只使用公开