Animating portraits using speech has received growing attention in recent years, with various creative and practical use cases. An ideal generated video should have good lip sync with the audio, natural facial expressions and head motions, and high frame quality. In this work, we present SPACE, which uses speech and a single image to generate high-resolution, and expressive videos with realistic head pose, without requiring a driving video. It uses a multi-stage approach, combining the controllability of facial landmarks with the high-quality synthesis power of a pretrained face generator. SPACE also allows for the control of emotions and their intensities. Our method outperforms prior methods in objective metrics for image quality and facial motions and is strongly preferred by users in pair-wise comparisons. The project website is available at https://deepimagination.cc/SPACE/
翻译:近些年来,使用语言动画的动画肖像受到越来越多的关注,出现了各种创造性和实用性案例。理想生成的视频应该与音频、自然面部表达和头部运动以及高框架质量保持良好的唇吻合。在这项工作中,我们展示空间,它使用语言和单一图像产生高分辨率,以及带有现实头部姿势的直观视频,而不需要驾驶视频。它使用多阶段方法,将面部标志的可控性与预先训练的面部生成器的高质量合成能力结合起来。空间还允许控制情绪及其强度。我们的方法在图像质量和面部运动客观衡量标准方面优于先前的方法,用户在对称比较中非常偏好。项目网站可在https://deepimagination.cc/SPACE/https://deepimagination查阅。