Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
翻译:不同的人以不同的个性化语音风格发言。 虽然现有的一发话头方法在唇接、自然面容表达和稳定的头部运动方面取得了显著进展, 但它们仍然无法在最后的会说话头视频中产生多样化的语音风格。 为了解决这个问题, 我们建议了一个一次性的样式控制式对说话面部生成框架 。 在一个简略的外观中, 我们的目标是从任意的引用性发言视频中找到一种说话风格, 然后驱动一发一发肖像, 用参考性发言风格和另一段音频来说话。 具体地说, 我们首先开发一个样式编码, 以提取风格参考视频的动态面部运动模式, 然后将它们编码成样式代码 。 之后, 我们引入一个样式控制式解码, 以合成语言化的面部动画, 从发言内容和风格代码代码中合成。 为了将引用语调的语调风格, 我们设计了一个有风格的适应性调变形变形变形器, 使编码的样式代码能够调整进化进音层/ 。 由于样式适应机制, 引用式发言风格可以更好地嵌入一个样式变形的图像中, 我们的图像图像的图像的图像, 正在演化的演化的图案, 演化的演制, 我们的变形图图图图图图图案, 演化的演制的演制的图像, 演化图案, 演化图图案, 演制的变形图案, 演制的变形图。