In this paper, we present a dynamic convolution kernel (DCK) strategy for convolutional neural networks. Using a fully convolutional network with the proposed DCKs, high-quality talking-face video can be generated from multi-modal sources (i.e., unmatched audio and video) in real time, and our trained model is robust to different identities, head postures, and input audios. Our proposed DCKs are specially designed for audio-driven talking face video generation, leading to a simple yet effective end-to-end system. We also provide a theoretical analysis to interpret why DCKs work. Experimental results show that our method can generate high-quality talking-face video with background at 60 fps. Comparison and evaluation between our method and the state-of-the-art methods demonstrate the superiority of our method.
翻译:在本文中,我们为进化神经网络提出了一个动态连锁内核(DCK)战略。利用与提议的DCK完全连动的网络,高质量的谈话式视频可以实时从多模式来源(即不匹配的音频和视频)中生成,我们经过培训的模型对不同身份、头部姿势和输入音频具有很强的特性。我们提议的DCK是专门为音频驱动的谈话面部视频生成设计的,导致一个简单而有效的端到端系统。我们还提供理论分析,解释DCKs为何工作。实验结果显示,我们的方法可以产生高质量的、背景为60英尺的谈话式视频。我们的方法和最新方法之间的比较和评价显示了我们方法的优势。