Driven by deep learning techniques and large-scale datasets, recent years have witnessed a paradigm shift in automatic lip reading. While the main thrust of Visual Speech Recognition (VSR) was improving accuracy of Audio Speech Recognition systems, other potential applications, such as biometric identification, and the promised gains of VSR systems, have motivated extensive efforts on developing the lip reading technology. This paper provides a comprehensive survey of the state-of-the-art deep learning based VSR research with a focus on data challenges, task-specific complications, and the corresponding solutions. Advancements in these directions will expedite the transformation of silent speech interface from theory to practice. We also discuss the main modules of a VSR pipeline and the influential datasets. Finally, we introduce some typical VSR application concerns and impediments to real-world scenarios as well as future research directions.
翻译:近年来,在深层学习技巧和大规模数据集的推动下,在自动读取嘴唇方面出现了范式的转变;视觉语音识别(VSR)的主旨是提高语音识别系统的准确性,而其他潜在应用,如生物鉴别,以及VSR系统的允诺收益等,则推动了开发唇读技术的广泛努力;本文件对基于VSR的最新深层学习研究进行了全面调查,重点是数据挑战、任务特有复杂因素和相应的解决方案;这些方向的进展将加快将沉默语音界面从理论转变为实践;我们还讨论了VSR管道的主要模块和有影响力的数据集;最后,我们介绍了一些典型的VSR应用对现实世界情景和未来研究方向的关切和障碍。