While audio-visual speech models can yield superior performance and robustness compared to audio-only models, their development and adoption are hindered by the lack of labeled and unlabeled audio-visual data and the cost to deploy one model per modality. In this paper, we present u-HuBERT, a self-supervised pre-training framework that can leverage both multimodal and unimodal speech with a unified masked cluster prediction objective. By utilizing modality dropout during pre-training, we demonstrate that a single fine-tuned model can achieve performance on par or better than the state-of-the-art modality-specific models. Moreover, our model fine-tuned only on audio can perform well with audio-visual and visual speech input, achieving zero-shot modality generalization for multiple speech processing tasks. In particular, our single model yields 1.2%/1.4%/27.2% speech recognition word error rate on LRS3 with audio-visual/audio/visual input. Codes and models are available at https://github.com/facebookresearch/av_hubert
翻译:虽然视听演讲模式与只听音模式相比能够产生优异的性能和稳健性,但由于缺乏贴标签和未贴标签的视听数据以及每个模式部署一个模式的成本,视听演讲模式的发展和采用受到阻碍。在本文件中,我们介绍了u-HuBERT,这是一个自我监督的训练前框架,可以利用多式和单式演讲,并有一个统一的蒙面集束预测目标。我们通过在培训前使用模式辍学,表明一个经过微调的单一模式能够达到比最先进的特定模式更好的或更好的性能。此外,我们仅对音频和视觉演讲投入进行微调的模型能够很好地表现,实现多个语音处理任务的零速率模式化。特别是,我们的单一模式产生1.2%/1.4%/27.2%的LRS3语言识别错误率,配有视听/视听/视觉投入。代码和模型可在https://gitub.com/faceboursearch/av_hubert上查阅。