When we speak, the prosody and content of the speech can be inferred from the movement of our lips. In this work, we explore the task of lip to speech synthesis, i.e., learning to generate speech given only the lip movements of a speaker where we focus on learning accurate lip to speech mappings for multiple speakers in unconstrained, large vocabulary settings. We capture the speaker's voice identity through their facial characteristics, i.e., age, gender, ethnicity and condition them along with the lip movements to generate speaker identity aware speech. To this end, we present a novel method "Lip2Speech", with key design choices to achieve accurate lip to speech synthesis in unconstrained scenarios. We also perform various experiments and extensive evaluation using quantitative, qualitative metrics and human evaluation.
翻译:当我们发言时,可以从我们嘴唇的动向中推断出发言的流言和内容。在这项工作中,我们探索了语言合成的嘴唇任务,即:学习发话,只考虑到一个发言者的嘴唇动作,我们侧重于在不加限制的大词汇环境中为多个发言者的语音绘图学习准确的嘴唇;我们通过发言者的面部特征,即年龄、性别、种族和语言,通过嘴唇动作来记录其声音特征,同时形成有声语意识的演讲;为此,我们提出了一种“Lip2Speech”的新颖方法,其中提出了在不受限制的情况下实现准确的口语合成的关键设计选择;我们还利用定量、定性的衡量尺度和人文评估,进行了各种实验和广泛的评价。