One-shot talking head generation produces lip-sync talking heads based on arbitrary audio and one source face. To guarantee the naturalness and realness, recent methods propose to achieve free pose control instead of simply editing mouth areas. However, existing methods do not preserve accurate identity of source face when generating head motions. To solve the identity mismatch problem and achieve high-quality free pose control, we present One-shot Pose-controllable Talking head generation network (OPT). Specifically, the Audio Feature Disentanglement Module separates content features from audios, eliminating the influence of speaker-specific information contained in arbitrary driving audios. Later, the mouth expression feature is extracted from the content feature and source face, during which the landmark loss is designed to enhance the accuracy of facial structure and identity preserving quality. Finally, to achieve free pose control, controllable head pose features from reference videos are fed into the Video Generator along with the expression feature and source face to generate new talking heads. Extensive quantitative and qualitative experimental results verify that OPT generates high-quality pose-controllable talking heads with no identity mismatch problem, outperforming previous SOTA methods.
翻译:为保证自然和真实性,最近的方法建议实现自由自制控制,而不是简单地编辑口腔区域。然而,现有方法在产生头部运动时并不保持源面的准确身份。为了解决身份错配问题并实现高质量的自由自制控制,我们展示了单发口音头生成网(OPT),具体地说,音频特征分解模块将内容特征与音频分离,消除任意驾驶声频中特定发言者信息的影响。后来,从内容特征和源面中提取了口语表达特征,在此期间,设计里程碑式损失是为了提高面部结构和身份保护质量的准确性。最后,为了实现自由自制控制,参考视频中的可控头部布局特征与表达特征和源面一起被注入视频发电机,以产生新的语音头部。广泛的定量和定性实验结果证实,巴勒莫制造出高质量、可控制面容、没有身份错配错问题的语音头部,比以前SOTA方法要好。