We propose a neural talking-head video synthesis model and demonstrate its application to video conferencing. Our model learns to synthesize a talking-head video using a source image containing the target person's appearance and a driving video that dictates the motion in the output. Our motion is encoded based on a novel keypoint representation, where the identity-specific and motion-related information is decomposed unsupervisedly. Extensive experimental validation shows that our model outperforms competing methods on benchmark datasets. Moreover, our compact keypoint representation enables a video conferencing system that achieves the same visual quality as the commercial H.264 standard while only using one-tenth of the bandwidth. Besides, we show our keypoint representation allows the user to rotate the head during synthesis, which is useful for simulating a face-to-face video conferencing experience.
翻译:我们提出神经谈话头部视频合成模型,并展示其在电视会议中的应用。我们的模型学会使用含有目标人物外观的源图像和在输出中决定运动的驱动视频来合成说话头部视频。我们的动议是根据一个新的关键点表达方式编码的,其中身份特有和与运动有关的信息不受监督地分解。广泛的实验性验证表明,我们的模型在基准数据集上的表现优于相互竞争的方法。此外,我们的紧凑关键点代表方式使得视频会议系统能够达到商业H.264标准的视觉质量,而只使用十分之一的带宽。此外,我们展示了我们的关键点代表方式允许用户在合成过程中旋转头部,这对模拟面对面的视频会议经验是有用的。