Given one reference facial image and a piece of speech as input, talking head generation aims to synthesize a realistic-looking talking head video. However, generating a lip-synchronized video with natural head movements is challenging. The same speech clip can generate multiple possible lip and head movements, that is, there is no one-to-one mapping relationship between them. To overcome this problem, we propose a Speech Feature Extractor (SFE) based on memory-augmented self-supervised contrastive learning, which introduces the memory module to store multiple different speech mapping results. In addition, we introduce the Mixed Density Networks (MDN) into the landmark regression task to generate multiple predicted facial landmarks. Extensive qualitative and quantitative experiments show that the quality of our facial animation is significantly superior to that of the state-of-the-art (SOTA). The code has been released at https://github.com/Yaxinzhao97/MACL.git.
翻译:考虑到一个参考面部图象和一个语音片段作为投入,谈话的头一代旨在合成一个现实的谈话头部视频。 然而,制作一个带有自然头部运动的唇同步视频具有挑战性。同一个语音片段可以产生多种可能的嘴唇和头部运动,也就是说,它们之间没有一对一的映射关系。为了解决这一问题,我们提议根据记忆强化自我监督的对比学习来制作一个语音特征提取器(SFE ), 它将记忆模块引入存储多种不同的语音绘图结果。 此外, 我们还将混合密度网络(MDN)引入具有里程碑意义的回归任务中, 以生成多个预测的面部标志。 广泛的定性和定量实验表明,我们面部动画的质量大大优于状态( SOTA) 。 该代码已在 https://github.com/Yaxinzhoo97/MACL.git发布 。</s>