In this work, we propose an ID-preserving talking head generation framework, which advances previous methods in two aspects. First, as opposed to interpolating from sparse flow, we claim that dense landmarks are crucial to achieving accurate geometry-aware flow fields. Second, inspired by face-swapping methods, we adaptively fuse the source identity during synthesis, so that the network better preserves the key characteristics of the image portrait. Although the proposed model surpasses prior generation fidelity on established benchmarks, to further make the talking head generation qualified for real usage, personalized fine-tuning is usually needed. However, this process is rather computationally demanding that is unaffordable to standard users. To solve this, we propose a fast adaptation model using a meta-learning approach. The learned model can be adapted to a high-quality personalized model as fast as 30 seconds. Last but not the least, a spatial-temporal enhancement module is proposed to improve the fine details while ensuring temporal coherency. Extensive experiments prove the significant superiority of our approach over the state of the arts in both one-shot and personalized settings.
翻译:在这项工作中,我们提出一个ID-保留谈话头部生成框架,该框架在两个方面推进了先前的方法。首先,相对于从稀少的流量中互译的方法,我们声称密集的地标对于实现精确的几何功能流场至关重要。第二,在面向式转换方法的启发下,我们在合成过程中将源身份进行适应性地结合,以便网络更好地保存图像肖像的关键特征。虽然拟议的模型超过了先前的对既定基准的忠诚度,以便进一步使谈话的头部生成符合实际使用的条件,但个人化微调通常是必要的。然而,这一过程在计算上要求很高,标准用户无法负担。为解决这一问题,我们建议采用一个快速的适应模式,以30秒的速度适应高质量的个性化模型。最后但并非最不重要的是,建议一个空间时空增强模块来改进精细的细节,同时确保时间的耐久性。广泛的实验证明,我们的方法在一线和个性化环境中都大大优于艺术状态。