Recent studies have shown impressive performance in Lip-to-speech synthesis that aims to reconstruct speech from visual information alone. However, they have been suffering from synthesizing accurate speech in the wild, due to insufficient supervision for guiding the model to infer the correct content. Distinct from the previous methods, in this paper, we develop a powerful Lip2Speech method that can reconstruct speech with correct contents from the input lip movements, even in a wild environment. To this end, we design multi-task learning that guides the model using multimodal supervision, i.e., text and audio, to complement the insufficient word representations of acoustic feature reconstruction loss. Thus, the proposed framework brings the advantage of synthesizing speech containing the right content of multiple speakers with unconstrained sentences. We verify the effectiveness of the proposed method using LRS2, LRS3, and LRW datasets.
翻译:最近的研究显示,在光靠视觉信息来重建语言的Lip-to-Speech合成中,成绩令人印象深刻,然而,由于在指导模型以推断正确内容方面监督不够,因此在野外的准确语言综合起来,结果十分困难。与以前的方法不同,我们在本文件中开发了一种强大的Lip2Speech方法,可以在输入唇运动中以正确内容重建语言,即使在野外环境中也是如此。我们为此设计了多任务学习,利用多式联运监督(即文字和音频)来指导模型,以补充声学特征重建损失的字面表现不足。因此,拟议的框架带来了将包含多位演讲者正确内容的词集在一起的优点。我们用LRS2、LRS3和LRW数据集核查了拟议方法的有效性。