We propose an end-to-end lecture video generation system that can generate realistic and complete lecture videos directly from annotated slides, instructor's reference voice and instructor's reference portrait video. Our system is primarily composed of a speech synthesis module with few-shot speaker adaptation and an adversarial learning-based talking-head generation module. It is capable of not only reducing instructors' workload but also changing the language and accent which can help the students follow the lecture more easily and enable a wider dissemination of lecture contents. Our experimental results show that the proposed model outperforms other current approaches in terms of authenticity, naturalness and accuracy. Here is a video demonstration of how our system works, and the outcomes of the evaluation and comparison: https://youtu.be/cY6TYkI0cog.
翻译:我们建议一个端对端的讲座视频生成系统,该系统可以直接从附加说明的幻灯片、教员参考声音和教员参考肖像视频中产生现实和完整的讲座视频,我们的系统主要包括一个语音合成模块,配有几发演讲者调整和以对抗性学习为基础的谈话头版模块,不仅能够减少教员的工作量,而且能够改变语言和口音,帮助学生更方便地听讲座,并能够更广泛地传播讲座内容。我们的实验结果表明,拟议的模型在真实性、自然性和准确性方面优于目前的其他方法。这里有一段视频演示,说明我们的系统是如何运作的,以及评估和比较的结果:https://youtu.be/cY6TYkI0cog。