Audio-visual speech recognition has received a lot of attention due to its robustness against acoustic noise. Recently, the performance of automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR, respectively) has been substantially improved, mainly due to the use of larger models and training sets. However, accurate labelling of datasets is time-consuming and expensive. Hence, in this work, we investigate the use of automatically-generated transcriptions of unlabelled datasets to increase the training set size. For this purpose, we use publicly-available pre-trained ASR models to automatically transcribe unlabelled datasets such as AVSpeech and VoxCeleb2. Then, we train ASR, VSR and AV-ASR models on the augmented training set, which consists of the LRS2 and LRS3 datasets as well as the additional automatically-transcribed data. We demonstrate that increasing the size of the training set, a recent trend in the literature, leads to reduced WER despite using noisy transcriptions. The proposed model achieves new state-of-the-art performance on AV-ASR on LRS2 and LRS3. In particular, it achieves a WER of 0.9% on LRS3, a relative improvement of 30% over the current state-of-the-art approach, and outperforms methods that have been trained on non-publicly available datasets with 26 times more training data.
翻译:视听语音识别因其对声学噪声的稳健性而受到广泛关注。近年来,由于使用更大的模型和训练集,自动音频、视觉和视听语音识别(ASR、VSR 和 AV-ASR)的性能大大提高。然而,准确的数据集标注非常耗时且昂贵。因此,在这项工作中,我们研究了使用未标记数据集的自动生成转录来增加训练集大小的方法。为此,我们使用公开可用的预训练 ASR 模型来自动转录未标记的数据集,例如 AVSpeech 和 VoxCeleb2。然后,我们在增广的训练集上训练 ASR、VSR 和 AV-ASR 模型,这些数据集包括 LRS2 和 LRS3 数据集以及其他自动转录的数据。我们证明,增加训练集的大小是文献中的一个最新趋势,尽管使用含有噪声的转录,但可以降低 WER。所提出的模型在 LRS2 和 LRS3 上实现了新的最先进性能。特别地,它在 LRS3 上实现了 0.9% 的 WER,相对于现有最先进方法改进了 30%,并且胜过了使用 26 倍的训练数据集训练的方法,这些数据集不是公开可用的。