Visual Speech Recognition (VSR) aims to infer speech into text depending on lip movements alone. As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements, and this makes the VSR models show degraded performance when they are applied to unseen speakers. In this paper, to remedy the performance degradation of the VSR model on unseen speakers, we propose prompt tuning methods of Deep Neural Networks (DNNs) for speaker-adaptive VSR. Specifically, motivated by recent advances in Natural Language Processing (NLP), we finetune prompts on adaptation data of target speakers instead of modifying the pre-trained model parameters. Different from the previous prompt tuning methods mainly limited to Transformer variant architecture, we explore different types of prompts, the addition, the padding, and the concatenation form prompts that can be applied to the VSR model which is composed of CNN and Transformer in general. With the proposed prompt tuning, we show that the performance of the pre-trained VSR model on unseen speakers can be largely improved by using a small amount of adaptation data (e.g., less than 5 minutes), even if the pre-trained model is already developed with large speaker variations. Moreover, by analyzing the performance and parameters of different types of prompts, we investigate when the prompt tuning is preferred over the finetuning methods. The effectiveness of the proposed method is evaluated on both word- and sentence-level VSR databases, LRW-ID and GRID.
翻译:视觉言语识别(VSR)旨在将言语转换成仅取决于嘴唇运动的文字。当它侧重于模拟演讲的视觉信息时,其性能对个人的嘴唇外表和动作具有内在的敏感性,这使得VSR模型在应用到隐形演讲者时显示性能退化。在本文中,为了纠正对隐形演讲者适用VSR模型的性能退化,我们建议对由CNN和一般变压器组成的VSR模型迅速调整深神经网络(DNNS)的方法。具体地说,由于在自然语言处理(NLP)方面最近的进展,我们微调了目标演讲者的适应数据,而不是对预先训练的模式参数的修改。与以前快速调法方法不同,主要局限于变换变换变式结构,我们探索不同种类的提示、添加、划线和拼写形式。我们用微调前的VSR模型的性能表现可以大大改进,使用微量的微调版数据(e.g.redustrual dal ex ex),我们用快速的变压方法来分析。