The goal of this work is to reconstruct speech from a silent talking face video. Recent studies have shown impressive performance on synthesizing speech from silent talking face videos. However, they have not explicitly considered on varying identity characteristics of different speakers, which place a challenge in the video-to-speech synthesis, and this becomes more critical in unseen-speaker settings. Distinct from the previous methods, our approach is to separate the speech content and the visage-style from a given silent talking face video. By guiding the model to independently focus on modeling the two representations, we can obtain the speech of high intelligibility from the model even when the input video of an unseen subject is given. To this end, we introduce speech-visage selection module that separates the speech content and the speaker identity from the visual features of the input video. The disentangled representations are jointly incorporated to synthesize speech through visage-style based synthesizer which generates speech by coating the visage-styles while maintaining the speech content. Thus, the proposed framework brings the advantage of synthesizing the speech containing the right content even when the silent talking face video of an unseen subject is given. We validate the effectiveness of the proposed framework on the GRID, TCD-TIMIT volunteer, and LRW datasets. The synthesized speech can be heard in supplementary materials.
翻译:这项工作的目标是从一个静默的面部视频中重建演讲内容和面部风格。最近的研究显示,在从静默的面部视频中合成演讲内容时,其表现令人印象深刻。然而,这些研究并未明确考虑不同演讲者的不同身份特征,这在视频到语音合成中是一个挑战,这在隐蔽的语音环境中变得更加重要。与以前的方法不同,我们的做法是将演讲内容和面部风格与一个静默的面部视频分开。通过指导模式独立侧重于两个演示的模型,我们可以从模型中获取高知性的发言,即使提供了一个隐形主题的输入视频。为此,我们引入了语音视频选择模块,将发言内容和发言者身份与输入视频的视觉特征区分开来。与以往的方法不同,我们的方法是将发言内容和面部与面部的面部式合成器分开,通过涂层式的面部来生成演讲内容,同时维护发言内容。因此,拟议的框架带来了将包含右面部内容的演讲内容合成的优势,即使提供了一个隐形主题的视频。为此,我们引入了静语面的语音选择模块选择模块选择模块选择模块,将演讲内容与发言的功能与输入全球气候信息数据库中的拟议版本。全球科学科学科学研究所的功能验证。