In this work, we present a novel method, named AV2vec, for learning audio-visual speech representations by multimodal self-distillation. AV2vec has a student and a teacher module, in which the student performs a masked latent feature regression task using the multimodal target features generated online by the teacher. The parameters of the teacher model are a momentum update of the student. Since our target features are generated online, AV2vec needs no iteration step like AV-HuBERT and the total training time cost is reduced to less than one-fifth. We further propose AV2vec-MLM in this study, which augments AV2vec with a masked language model (MLM)-style loss using multitask learning. Our experimental results show that AV2vec achieved comparable performance to the AV-HuBERT baseline. When combined with an MLM-style loss, AV2vec-MLM outperformed baselines and achieved the best performance on the downstream tasks.
翻译:在这项工作中,我们提出了一个名为AV2vec的新颖方法,用于通过多式联运自我蒸馏来学习视听语言。AV2vec有一个学生和教师模块,学生在其中使用教师在线生成的多式联运目标特征进行隐形潜在回归任务。教师模型的参数是学生的动力更新。由于我们的目标特征是在线生成的,AV2vec不需要像AV-HuBERT那样的迭代步骤,培训总时间成本将减少到不到五分之一。我们在此研究中进一步提议AV2vec-MLMMM,该模块将AV2vec与一种蒙面语言模型(MLMMM)模式(ML)模式(ML)类损失相匹配。我们的实验结果表明,AV2vec取得了与AV-HuBERT基线的类似性能。当与MM-模式损失相结合时,AV2vec-MLMMM超越了下游任务的最佳性能。