Lip region-of-interest (ROI) is conventionally used for visual input in the lipreading task. Few works have adopted the entire face as visual input because lip-excluded parts of the face are usually considered to be redundant and irrelevant to visual speech recognition. However, faces contain much more detailed information than lips, such as speakers' head pose, emotion, identity etc. We argue that such information might benefit visual speech recognition if a powerful feature extractor employing the entire face is trained. In this work, we propose to adopt the entire face for lipreading with self-supervised learning. AV-HuBERT, an audio-visual multi-modal self-supervised learning framework, was adopted in our experiments. Our experimental results showed that adopting the entire face achieved 16% relative word error rate (WER) reduction on the lipreading task, compared with the baseline method using lip as visual input. Without self-supervised pretraining, the model with face input achieved a higher WER than that using lip input in the case of limited training data (30 hours), while a slightly lower WER when using large amount of training data (433 hours).
翻译:在唇读任务中,通常将利普区域(ROI)用于视觉输入。很少有作品将整张面部作为视觉输入,因为面部的唇排出部分通常被认为是多余的,与视觉语音识别无关。然而,面部包含比嘴唇更详细得多的信息,例如发言者的头部姿势、情感、身份等。我们认为,如果对使用整张面部的强大特征提取器进行培训,这种信息可能会有利于视觉语音识别。在这项工作中,我们提议采用整张面部作为用自我监督的学习来唇读。AV-HuBERT是一个多式视听多式自我覆盖的学习框架,在我们的实验中被采用。我们的实验结果表明,与使用用嘴唇作为视觉输入的基线方法相比,整个面部的减幅比用大量培训数据(433小时)时略低。