Video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker's lip movements and the produced sound. We introduce Audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units. AV-HuBERT learns powerful audio-visual speech representation benefiting both lip-reading and automatic speech recognition. On the largest public lip-reading benchmark LRS3 (433 hours), AV-HuBERT achieves 32.5% WER with only 30 hours of labeled data, outperforming the former state-of-the-art approach (33.6%) trained with a thousand times more transcribed video data (31K hours). The lip-reading WER is further reduced to 26.9% when using all 433 hours of labeled data from LRS3 and combined with self-training. Using our audio-visual representation on the same benchmark for audio-only speech recognition leads to a 40% relative WER reduction over the state-of-the-art performance (1.3% vs 2.3%). Our code and models are available at https://github.com/facebookresearch/av_hubert
翻译:语音录像包含相关的音频和视觉信息,为从发言者的嘴唇运动和制作的声音中学习语音表现提供了一个强有力的信号。我们引入了视听隐藏单位BERT(AV-HuBERT),这是一个自我监督的视听演讲代表学习框架(AV-HuBERT),它掩盖了多流视频输入,并预测了自动发现和迭代改进的多式隐藏单元。AV-HuBERT学会了强大的视听讲话表现,既有利于唇读,也有利于自动语音识别。在最大的公共唇读基准LRS3(433小时)上,AV-HuBER用标签数据只达到32.5%的WER,只有30小时,比以前的艺术状态方法(33.6%)要多上千倍。当使用所有433小时的标签数据(LRS3)和自动语音识别时,唇读WERER还被进一步减少到26.9%。用我们在同一基准上的语音视频表现识别基准上的视听表现为40%的相对WERULA1.3/com 版本。在州可获取的2.3的版本/http://www-compasisabal practal practasmal press pressmal press pressal_ pressmal