In this paper, we present StyleLipSync, a style-based personalized lip-sync video generative model that can generate identity-agnostic lip-synchronizing video from arbitrary audio. To generate a video of arbitrary identities, we leverage expressive lip prior from the semantically rich latent space of a pre-trained StyleGAN, where we can also design a video consistency with a linear transformation. In contrast to the previous lip-sync methods, we introduce pose-aware masking that dynamically locates the mask to improve the naturalness over frames by utilizing a 3D parametric mesh predictor frame by frame. Moreover, we propose a few-shot lip-sync adaptation method for an arbitrary person by introducing a sync regularizer that preserves lips-sync generalization while enhancing the person-specific visual information. Extensive experiments demonstrate that our model can generate accurate lip-sync videos even with the zero-shot setting and enhance characteristics of an unseen face using a few seconds of target video through the proposed adaptation method. Please refer to our project page.
翻译:在本文中,我们提出了StyleLipSync,这是一种基于风格的个性化唇形同步视频生成模型,可以从任意音频生成身份无关的唇形同步视频。为了生成任意身份的视频,我们利用了预训练 StyleGAN 的语义丰富的潜在空间中的表达性唇形先验,其中我们还可以使用线性变换设计视频一致性。与先前的唇形同步方法相比,我们引入了姿势感知掩蔽,通过利用一种3D参数化网格预测器逐帧动态定位掩蔽,以提高帧之间的自然程度。此外,我们提出了一种针对任意人的少样本唇形同步适应方法,通过引入同步正则化器来保留唇形同步泛化能力,同时增强个人特定的视觉信息。广泛的实验证明,我们的模型可以生成准确的唇形同步视频,甚至在零样本设置下使用几秒钟的目标视频增强未见过面孔的特征。请参阅我们的项目页面。