With the advance in self-supervised learning for audio and visual modalities, it has become possible to learn a robust audio-visual speech representation. This would be beneficial for improving the audio-visual speech recognition (AVSR) performance, as the multi-modal inputs contain more fruitful information in principle. In this paper, based on existing self-supervised representation learning methods for audio modality, we therefore propose an audio-visual representation learning approach. The proposed approach explores both the complementarity of audio-visual modalities and long-term context dependency using a transformer-based fusion module and a flexible masking strategy. After pre-training, the model is able to extract fused representations required by AVSR. Without loss of generality, it can be applied to single-modal tasks, e.g. audio/visual speech recognition by simply masking out one modality in the fusion module. The proposed pre-trained model is evaluated on speech recognition and lipreading tasks using one or two modalities, where the superiority is revealed.
翻译:随着在视听模式方面自我监督学习的进展,可以学习一种强大的视听语言表现方式,这将有利于改善视听语言识别(AVSR)性能,因为多模式投入原则上包含更有成效的信息;在本文件中,根据现有自我监督的视听模式教学方法,我们因此建议采用视听代表学习方法;拟议方法利用基于变压器的聚合模块和灵活的遮盖战略,探索视听模式和长期背景依赖的互补性;在培训前,该模式能够提取AVSR所要求的混合表达方式;在不丧失一般性的情况下,它可以适用于单一模式的任务,例如,通过简单地遮盖聚变模式中的一种模式来进行视听语言识别,对拟议的预先培训模式进行评价,用一种或两种模式来评估语音识别和唇读任务,其中揭示优越性。