Previous studies have explored generating accurately lip-synced talking faces for arbitrary targets given audio conditions. However, most of them deform or generate the whole facial area, leading to non-realistic results. In this work, we delve into the formulation of altering only the mouth shapes of the target person. This requires masking a large percentage of the original image and seamlessly inpainting it with the aid of audio and reference frames. To this end, we propose the Audio-Visual Context-Aware Transformer (AV-CAT) framework, which produces accurate lip-sync with photo-realistic quality by predicting the masked mouth shapes. Our key insight is to exploit desired contextual information provided in audio and visual modalities thoroughly with delicately designed Transformers. Specifically, we propose a convolution-Transformer hybrid backbone and design an attention-based fusion strategy for filling the masked parts. It uniformly attends to the textural information on the unmasked regions and the reference frame. Then the semantic audio information is involved in enhancing the self-attention computation. Additionally, a refinement network with audio injection improves both image and lip-sync quality. Extensive experiments validate that our model can generate high-fidelity lip-synced results for arbitrary subjects.
翻译:先前的研究探索了如何为任意目标生成精确的嘴相话面容,但大多数都变形或产生整个面部区域,导致不现实的结果。在这项工作中,我们深入研究只改变目标人的嘴形的形状的配方。这要求掩蔽大部分原始图像,并在音频和参考框架的帮助下,无缝地涂色。为此,我们提议了视听背景-软件变换器框架,通过预测蒙面嘴形状,使唇相近质量与照片-现实质量相近。我们的关键洞察力是利用以音频和视觉模式提供的、用精心设计的变形器彻底提供的环境信息。具体地说,我们建议了革命-变形混合骨架,并设计了填补遮面部分的基于关注的融合战略。我们一致地关注无色区域和参考框架的文字信息。然后,语异音频信息参与加强自控计算。此外,一个精细的网络,用音频注入式改进了图像和唇模质的测试,可以产生高清晰度的实验质量。