In this work, we address the task of unconditional head motion generation to animate still human faces in a low-dimensional semantic space from a single reference pose. Different from traditional audio-conditioned talking head generation that seldom puts emphasis on realistic head motions, we devise a GAN-based architecture that learns to synthesize rich head motion sequences over long duration while maintaining low error accumulation levels.In particular, the autoregressive generation of incremental outputs ensures smooth trajectories, while a multi-scale discriminator on input pairs drives generation toward better handling of high- and low-frequency signals and less mode collapse.We experimentally demonstrate the relevance of the proposed method and show its superiority compared to models that attained state-of-the-art performances on similar tasks.
翻译:在这项工作中,我们解决了无条件头部运动生成的任务,以便从单个参考姿势在低维语义空间中为静止的人脸增添动画效果。与传统的音频条件对话式头部生成不同,很少关注逼真的头部运动,我们构建了一种基于GAN的架构,学习在长时间内合成丰富的头部运动序列,同时保持低的误差积累水平。特别是,自回归生成增量输出确保平滑轨迹,而输入对的多尺度鉴别器可以驱使生成更好地处理高频和低频信号,并减轻模式崩溃问题。我们在实验中展示了所提出的方法的相关性,并显示了其相对于在类似任务上达到最新性能的模型的优越性。