Audio-based automatic speech recognition (ASR) degrades significantly in noisy environments and is particularly vulnerable to interfering speech, as the model cannot determine which speaker to transcribe. Audio-visual speech recognition (AVSR) systems improve robustness by complementing the audio stream with the visual information that is invariant to noise and helps the model focus on the desired speaker. However, previous AVSR work focused solely on the supervised learning setup; hence the progress was hindered by the amount of labeled data available. In this work, we present a self-supervised AVSR framework built upon Audio-Visual HuBERT (AV-HuBERT), a state-of-the-art audio-visual speech representation learning model. On the largest available AVSR benchmark dataset LRS3, our approach outperforms prior state-of-the-art by ~50% (28.0% vs. 14.1%) using less than 10% of labeled data (433hr vs. 30hr) in the presence of babble noise, while reducing the WER of an audio-based model by over 75% (25.8% vs. 5.8%) on average.
翻译:以声音为基础的自动语音识别(ASR)在吵闹的环境中显著降低,特别容易受到干扰,因为模型无法确定哪位演讲者进行转录。视听语音识别(AVSR)系统通过补充音频流的视觉信息来增强稳健性,这种视频信息对噪音不起作用,有助于模式对想要的演讲者给予关注。然而,以前AVSR的工作完全侧重于监督的学习设置;因此,进展受到标签数据数量的限制。在这项工作中,我们提出了一个以视听语音HuBERT(AV-HuBERT)为基础的自我监督的AVSR框架,这是一个最先进的视听语音代表学习模式。在最大可用的AVSR基准数据集LRS3上,我们的方法比先前的艺术状态高出了50%(28.0%对14.1%),使用不到10%的标签数据(433小时对30小时),同时平均将音频模型的WER减少75%以上(25.8%对5.8%)。