We propose a novel method for generating high-resolution videos of talking-heads from speech audio and a single 'identity' image. Our method is based on a convolutional neural network model that incorporates a pre-trained StyleGAN generator. We model each frame as a point in the latent space of StyleGAN so that a video corresponds to a trajectory through the latent space. Training the network is in two stages. The first stage is to model trajectories in the latent space conditioned on speech utterances. To do this, we use an existing encoder to invert the generator, mapping from each video frame into the latent space. We train a recurrent neural network to map from speech utterances to displacements in the latent space of the image generator. These displacements are relative to the back-projection into the latent space of an identity image chosen from the individuals depicted in the training dataset. In the second stage, we improve the visual quality of the generated videos by tuning the image generator on a single image or a short video of any chosen identity. We evaluate our model on standard measures (PSNR, SSIM, FID and LMD) and show that it significantly outperforms recent state-of-the-art methods on one of two commonly used datasets and gives comparable performance on the other. Finally, we report on ablation experiments that validate the components of the model. The code and videos from experiments can be found at https://mohammedalghamdi.github.io/talking-heads-acm-mm
翻译:我们提出一种创新方法,用语音音频和单一的“身份”图像生成高清晰音头高清晰视频。 我们的方法基于一个进化神经网络模型, 包含一个预训练的StyleGAN发电机。 我们将每个框架建为StyleGAN潜在空间的一个点, 以便视频与潜伏空间的轨迹相对应。 培训网络分为两个阶段。 第一阶段是模拟以语音表达方式为条件的潜层空间的轨迹。 为此, 我们使用一个现有的编码器将生成器倒转, 从每个视频框架绘制到隐性空间。 我们训练一个经常性神经网络, 从语音表达到图像生成器潜在空间的迁移。 我们将每个框架的图像网络进行绘图, 从图像生成器到图像生成器的潜在空间的映射。 这些变换与回投影到从培训数据集中所描述的个人所选取的身份图像的潜影体空间的隐形空间相对。 在第二阶段, 我们可以通过对图像生成器的单个图像或任何选定身份的简短视频进行调来改进视频的视觉质量。 我们评估了我们的标准计量模型( PSNI、 SSM 和LMD ) 在最新的实验中, 展示了我们所使用的两种版本数据, 展示了最新版本, 展示中, 展示了我们使用了另一种方法。