Lip reading aims to predict speech based on lip movements alone. As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements. This makes the lip reading models show degraded performance when they are applied to unseen speakers due to the mismatch between training and testing conditions. Speaker adaptation technique aims to reduce this mismatch between train and test speakers, thus guiding a trained model to focus on modeling the speech content without being intervened by the speaker variations. In contrast to the efforts made in audio-based speech recognition for decades, the speaker adaptation methods have not well been studied in lip reading. In this paper, to remedy the performance degradation of lip reading model on unseen speakers, we propose a speaker-adaptive lip reading method, namely user-dependent padding. The user-dependent padding is a speaker-specific input that can participate in the visual feature extraction stage of a pre-trained lip reading model. Therefore, the lip appearances and movements information of different speakers can be considered during the visual feature encoding, adaptively for individual speakers. Moreover, the proposed method does not need 1) any additional layers, 2) to modify the learned weights of the pre-trained model, and 3) the speaker label of train data used during pre-train. It can directly adapt to unseen speakers by learning the user-dependent padding only, in a supervised or unsupervised manner. Finally, to alleviate the speaker information insufficiency in public lip reading databases, we label the speaker of a well-known audio-visual database, LRW, and design an unseen-speaker lip reading scenario named LRW-ID.
翻译:读唇术的目的是仅仅根据嘴唇运动来预测言论。当它侧重于视觉信息以模拟讲话时,其性能对个人嘴唇外表和动作具有内在的敏感性。这使得唇读模型表明,由于培训和测试条件不匹配,在对看不见的演讲者应用时,其性能会降低;演讲者适应技术旨在减少火车和测试者之间的这种不匹配,从而指导一个经过培训的模式,在不受到演讲者变换的干扰的情况下,侧重于对发言内容进行模拟的示范。与数十年来在以声音为基础的语音识别方面所作的努力相比,在唇读中并没有很好地研究过演讲者的适应性适应方法。在本文中,为了纠正在隐性演讲者上唇读模型的性能退化,我们建议采用一种以语言为主的唇读法读取的唇读方法,即以用户为主的唇读法读写法;在经过培训的嘴唇读模型的视觉提取阶段,只能用语言者读懂的纸质,在经过培训的模版数据库中,用语言前的纸质数据库里,通过学习的模修改。