In this work, we propose a technique to transfer speech recognition capabilities from audio speech recognition systems to visual speech recognizers, where our goal is to utilize audio data during lipreading model training. Impressive progress in the domain of speech recognition has been exhibited by audio and audio-visual systems. Nevertheless, there is still much to be explored with regards to visual speech recognition systems due to the visual ambiguity of some phonemes. To this end, the development of visual speech recognition models is crucial given the instability of audio models. The main contributions of this work are i) building on recent state-of-the-art word-based lipreading models by integrating sequence-level and frame-level Knowledge Distillation (KD) to their systems; ii) leveraging audio data during training visual models, a feat which has not been utilized in prior word-based work; iii) proposing the Gaussian-shaped averaging in frame-level KD, as an efficient technique that aids the model in distilling knowledge at the sequence model encoder. This work proposes a novel and competitive architecture for lip-reading, as we demonstrate a noticeable improvement in performance, setting a new benchmark equals to 88.64% on the LRW dataset.
翻译:在这项工作中,我们提出一种技术,将语音识别系统的语音识别能力从音频语音识别系统转移到视觉语音识别器,我们的目标是在唇读模式培训中利用音频数据。声频和视听系统展示了语音识别领域的显著进展。然而,由于一些语音的视觉模糊性,在视觉语音识别系统方面仍有许多有待探索之处。为此,鉴于音频模型的不稳定性,视觉语音识别模型的开发至关重要。这项工作的主要贡献是:(一) 以最新最先进的单词读写模型为基础,将顺序层次和框架层次知识蒸馏(KD)融入其系统;(二) 在培训视觉模型中利用音频数据,这在先前的文字工作中没有被利用过;(三) 提出在框架级别KD中以高斯语为均势,作为一种有效的技术,帮助模型在序列模型编码中蒸馏知识。这项工作提出了一个新的和竞争性的唇读结构,我们展示了性能的显著改进,将新基准设定为88-64年数据。