The goal of this work is to reconstruct speech from a silent talking face video. Recent studies have shown impressive performance on synthesizing speech from silent talking face videos. However, they have not explicitly considered on varying identity characteristics of different speakers, which place a challenge in the video-to-speech synthesis, and this becomes more critical in unseen-speaker settings. Our approach is to separate the speech content and the visage-style from a given silent talking face video. By guiding the model to independently focus on modeling the two representations, we can obtain the speech of high intelligibility from the model even when the input video of an unseen subject is given. To this end, we introduce speech-visage selection that separates the speech content and the speaker identity from the visual features of the input video. The disentangled representations are jointly incorporated to synthesize speech through visage-style based synthesizer which generates speech by coating the visage-styles while maintaining the speech content. Thus, the proposed framework brings the advantage of synthesizing the speech containing the right content even with the silent talking face video of an unseen subject. We validate the effectiveness of the proposed framework on the GRID, TCD-TIMIT volunteer, and LRW datasets.
翻译:这项工作的目标是从静默谈话面部视频中重建演讲内容和面部风格。最近的研究显示,在从静默谈话面部视频中合成演讲的合成性能令人印象深刻。然而,这些研究并未明确考虑不同演讲者的不同身份特征特征,这给视频到语音合成带来了挑战,而这在隐蔽的语音环境中变得更加重要。我们的做法是将演讲内容和面部风格与某一静默谈话面部视频分开。通过引导模型独立侧重于模拟这两种演示的模型,我们可以从模型中获取高度感知性的发言,即使提供了隐蔽主题的输入视频。为此,我们引入了语音视频选择,将演讲内容和发言者身份与输入视频的视觉特征区分开来。不相交的演示通过基于粘贴式的合成器合成语言,在维护语音内容的同时,通过涂上粘贴粘贴图像式生成语音。因此,拟议框架带来了将包含正确内容的演讲内容与一个隐形主题的静语面部视频合成的优势。为此,我们引入了将演讲内容与发言内容与隐蔽式面部视频分开的语音图像选择,我们验证了拟议中的数据框架的有效性。