Visual speech recognition (VSR) aims to recognize the content of speech based on lip movements, without relying on the audio stream. Advances in deep learning and the availability of large audio-visual datasets have led to the development of much more accurate and robust VSR models than ever before. However, these advances are usually due to the larger training sets rather than the model design. Here we demonstrate that designing better models is equally as important as using larger training sets. We propose the addition of prediction-based auxiliary tasks to a VSR model, and highlight the importance of hyperparameter optimization and appropriate data augmentations. We show that such a model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin. It even outperforms models that were trained on non-publicly available datasets containing up to to 21 times more data. We show, furthermore, that using additional training data, even in other languages or with automatically generated transcriptions, results in further improvement.
翻译:视觉语音识别(VSR)旨在承认基于嘴唇运动的语音内容,而不必依赖音频流。深层次学习的进步和大型视听数据集的提供,导致开发比以往任何时候更准确、更稳健的VSR模型。然而,这些进步通常归功于规模更大的培训组,而不是模型设计。在这里,我们证明设计更好的模型与使用更大的培训组同样重要。我们提议在VSR模型中增加基于预测的辅助任务,并强调超参数优化和适当数据增强的重要性。我们表明,这种模型适用于不同语言,并且超越了以前在大幅度的公开数据集方面受过培训的所有方法。甚至比在非公开数据集方面受过培训的模型还要差,而这些数据包含多达21倍的数据。我们还要表明,使用额外的培训数据,即使是以其他语言或自动生成的誊本,都会导致进一步的改进。