This paper investigates a novel task of talking face video generation solely from speeches. The speech-to-video generation technique can spark interesting applications in entertainment, customer service, and human-computer-interaction industries. Indeed, the timbre, accent and speed in speeches could contain rich information relevant to speakers' appearance. The challenge mainly lies in disentangling the distinct visual attributes from audio signals. In this article, we propose a light-weight, cross-modal distillation method to extract disentangled emotional and identity information from unlabelled video inputs. The extracted features are then integrated by a generative adversarial network into talking face video clips. With carefully crafted discriminators, the proposed framework achieves realistic generation results. Experiments with observed individuals demonstrated that the proposed framework captures the emotional expressions solely from speeches, and produces spontaneous facial motion in the video output. Compared to the baseline method where speeches are combined with a static image of the speaker, the results of the proposed framework is almost indistinguishable. User studies also show that the proposed method outperforms the existing algorithms in terms of emotion expression in the generated videos.
翻译:本文探讨的是仅从演讲中交谈面部视频生成的新颖任务。 语音到视频生成技术可以在娱乐、 客户服务和人- 计算机互动行业中引发有趣的应用。 事实上, 演讲的语调、 口音和速度可能包含与演讲者外观有关的丰富信息。 挑战主要在于将不同的视觉特征与音频信号脱钩。 在文章中, 我们提出了一个轻量、 交叉式蒸馏方法, 以从未贴标签的视频输入中提取分解的情感和身份信息。 然后, 所提取的功能可以通过基因对抗网络整合到谈话面部视频剪辑中。 在精心制作的导师的帮助下, 拟议的框架可以取得现实的生成结果。 与所观测到的个人实验表明, 拟议的框架只从演讲中捕捉情感表达, 并产生视频输出的自发面部运动。 与演讲与静态图像相结合的基线方法相比, 所拟议的框架的结果几乎不可分辨。 用户研究还显示, 拟议的方法在生成的视频中的情感表达方式方面超越了现有的算法。