We present RAVEn, a self-supervised multi-modal approach to jointly learn visual and auditory speech representations. Our pre-training objective involves encoding masked inputs, and then predicting contextualised targets generated by slowly-evolving momentum encoders. Driven by the inherent differences between video and audio, our design is asymmetric w.r.t. the two modalities' pretext tasks: Whereas the auditory stream predicts both the visual and auditory targets, the visual one predicts only the auditory targets. We observe strong results in low- and high-resource labelled data settings when fine-tuning the visual and auditory encoders resulting from a single pre-training stage, in which the encoders are jointly trained. Notably, RAVEn surpasses all self-supervised methods on visual speech recognition (VSR) on LRS3, and combining RAVEn with self-training using only 30 hours of labelled data even outperforms a recent semi-supervised method trained on 90,000 hours of non-public data. At the same time, we achieve state-of-the-art results in the LRS3 low-resource setting for auditory speech recognition (as well as for VSR). Our findings point to the viability of learning powerful speech representations entirely from raw video and audio, i.e., without relying on handcrafted features. Code and models are available at https://github.com/ahaliassos/raven.
翻译:我们提出RAVEn,一种自监督的多模式方法,用于联合学习视听语音表示。我们的预训练目标涉及编码掩蔽输入,然后预测由缓慢演变的动量编码器生成的上下文目标。由于视频和音频之间固有的差异,我们的设计在两种模态的前提任务方面是不对称的:虽然听觉流预测视觉和听觉目标,但视觉流仅预测听觉目标。在使用单个预训练阶段学习的视听编码器微调时,我们观察到在低和高资源标记数据设置中均获得了强大的结果。值得注意的是,RAVEn在LRS3上的视觉语音识别(VSR)方面超过了所有自监督方法,将RAVEn与仅使用30小时标记数据的自训练相结合,即使胜过了基于90,000小时非公共数据训练的最近的半监督方法。同时,我们在LRS3低资源设置中实现了听觉语音识别(以及VSR)的最新结果。我们的发现表明,完全从原始视频和音频中学习强大的语音表示是可行的,即无需依赖手工制作的特征。代码和模型可在https://github.com/ahaliassos/raven上获得。