This paper investigates self-supervised pre-training for audio-visual speaker representation learning where a visual stream showing the speaker's mouth area is used alongside speech as inputs. Our study focuses on the Audio-Visual Hidden Unit BERT (AV-HuBERT) approach, a recently developed general-purpose audio-visual speech pre-training framework. We conducted extensive experiments probing the effectiveness of pre-training and visual modality. Experimental results suggest that AV-HuBERT generalizes decently to speaker related downstream tasks, improving label efficiency by roughly ten fold for both audio-only and audio-visual speaker verification. We also show that incorporating visual information, even just the lip area, greatly improves the performance and noise robustness, reducing EER by 38% in the clean condition and 75% in noisy conditions.
翻译:本文调查了自我监督的视听演讲者代表性学习培训前培训,在这种学习中,将显示演讲者口腔面积的视觉流与讲话作为投入使用。我们的研究侧重于视听隐藏单元BERT(AV-HuBERT)方法,这是最近开发的通用视听演讲预培训框架。我们进行了广泛的实验,检验培训前和视觉模式的有效性。实验结果表明,AV-HuBERT将与演讲者有关的下游任务作体面的概括,将标签效率提高约十倍,用于只听音和视听演讲者核查。我们还表明,将视觉信息,即使是嘴边区域,也大大改进了性能和噪音的稳健性,在清洁的条件下将EER降低了38%,在吵闹的条件下将EER降低了75%。