Lipreading refers to understanding and further translating the speech of a speaker in the video into natural language. State-of-the-art lipreading methods excel in interpreting overlap speakers, i.e., speakers appear in both training and inference sets. However, generalizing these methods to unseen speakers incurs catastrophic performance degradation due to the limited number of speakers in training bank and the evident visual variations caused by the shape/color of lips for different speakers. Therefore, merely depending on the visible changes of lips tends to cause model overfitting. To address this problem, we propose to use multi-modal features across visual and landmarks, which can describe the lip motion irrespective to the speaker identities. Then, we develop a sentence-level lipreading framework based on visual-landmark transformers, namely LipFormer. Specifically, LipFormer consists of a lip motion stream, a facial landmark stream, and a cross-modal fusion. The embeddings from the two streams are produced by self-attention, which are fed to the cross-attention module to achieve the alignment between visuals and landmarks. Finally, the resulting fused features can be decoded to output texts by a cascade seq2seq model. Experiments demonstrate that our method can effectively enhance the model generalization to unseen speakers.
翻译:唇印是指理解和进一步将视频中发言者的演讲译成自然语言。 最先进的唇读方法在解释重叠的发言者方面非常出色,即,在培训和推论组合中都出现发言者,然而,将这些方法推广到隐蔽的发言者会造成灾难性的性能退化,因为培训银行的发言者人数有限,而且不同发言者的嘴唇形状/颜色造成明显的视觉变异。因此,仅仅取决于嘴唇的可见变化往往会导致模式的过度配置。为了解决这一问题,我们提议在视觉和标志中使用多种模式的功能,可以描述嘴唇运动,而不论发言者的身份如何。然后,我们根据视觉和地标变形器(即LipFormer)来制定一个句级唇读框架。具体地说,LipFormer由唇动流、面状标志流和交叉模式融合组成。两种流的嵌入通过自我感应产生,这些嵌入模块用于交叉使用模块,以便实现视觉和标志之间的对准。最后,产生的导导式模化式的导体功能可以有效地展示到可升级的版本。