In this work, we propose a joint system combining a talking face generation system with a text-to-speech system that can generate multilingual talking face videos from only the text input. Our system can synthesize natural multilingual speeches while maintaining the vocal identity of the speaker, as well as lip movements synchronized to the synthesized speech. We demonstrate the generalization capabilities of our system by selecting four languages (Korean, English, Japanese, and Chinese) each from a different language family. We also compare the outputs of our talking face generation model to outputs of a prior work that claims multilingual support. For our demo, we add a translation API to the preprocessing stage and present it in the form of a neural dubber so that users can utilize the multilingual property of our system more easily.
翻译:在这项工作中,我们建议建立一个联合系统,将一个会说话的面部生成系统与一个文本到语音系统相结合,该系统只能从文本输入中产生多语种的面部视频。我们的系统可以综合自然的多语种语言演讲,同时保持发言者的语音特性,以及与合成语言同步的嘴唇运动。我们通过从不同语言的家庭选择四种语言(韩文、英文、日文和中文)来展示我们的系统的一般化能力。我们还将我们说话的面部生成模型与先前要求多语种支持的工作产出进行比较。对于我们的演示,我们可以在预处理阶段添加一个翻译 API, 并以神经假体的形式展示它, 以便用户能够更容易地利用我们系统的多语种特性。