The challenge of talking face generation from speech lies in aligning two different modal information, audio and video, such that the mouth region corresponds to input audio. Previous methods either exploit audio-visual representation learning or leverage intermediate structural information such as landmarks and 3D models. However, they struggle to synthesize fine details of the lips varying at the phoneme level as they do not sufficiently provide visual information of the lips at the video synthesis step. To overcome this limitation, our work proposes Audio-Lip Memory that brings in visual information of the mouth region corresponding to input audio and enforces fine-grained audio-visual coherence. It stores lip motion features from sequential ground truth images in the value memory and aligns them with corresponding audio features so that they can be retrieved using audio input at inference time. Therefore, using the retrieved lip motion features as visual hints, it can easily correlate audio with visual dynamics in the synthesis step. By analyzing the memory, we demonstrate that unique lip features are stored in each memory slot at the phoneme level, capturing subtle lip motion based on memory addressing. In addition, we introduce visual-visual synchronization loss which can enhance lip-syncing performance when used along with audio-visual synchronization loss in our model. Extensive experiments are performed to verify that our method generates high-quality video with mouth shapes that best align with the input audio, outperforming previous state-of-the-art methods.
翻译:通过语音和视频交谈的面部生成挑战在于调和两种不同的模式信息,即音频和视频,使口腔区域与输入的音频相对应。以前的方法要么利用视听代表性学习,要么利用中间结构信息,例如地标和3D模型。然而,它们努力将电话层不同嘴唇的细细细节合成,因为它们不能在视频合成步骤中充分提供嘴唇的视觉信息。为了克服这一限制,我们的工作提议音频-Lip内存带给口部区域的视觉信息,以输入音频,实施细微的视听一致性。此外,我们采用视觉同步模式将连续地面图像的唇动功能储存在价值记忆中,并将它们与相应的音频特性相匹配,以便利用音频输入来检索它们。因此,利用检索到的唇动功能作为视觉提示,可以很容易地将声音与视频合成步骤中的嘴部动态动态联系起来。通过分析记忆,我们展示了每个记忆层的独有的嘴部特征,根据记忆处理而捕捉到微妙的唇动运动。此外,我们引入了视觉同步的模模质损失,这可以加强嘴部的模模模模模模模化,同时用高的模模模化方法来校制。