Though significant progress has been made for speaker-dependent Video-to-Speech (VTS) synthesis, little attention is devoted to multi-speaker VTS that can map silent video to speech, while allowing flexible control of speaker identity, all in a single system. This paper proposes a novel multi-speaker VTS system based on cross-modal knowledge transfer from voice conversion (VC), where vector quantization with contrastive predictive coding (VQCPC) is used for the content encoder of VC to derive discrete phoneme-like acoustic units, which are transferred to a Lip-to-Index (Lip2Ind) network to infer the index sequence of acoustic units. The Lip2Ind network can then substitute the content encoder of VC to form a multi-speaker VTS system to convert silent video to acoustic units for reconstructing accurate spoken content. The VTS system also inherits the advantages of VC by using a speaker encoder to produce speaker representations to effectively control the speaker identity of generated speech. Extensive evaluations verify the effectiveness of proposed approach, which can be applied in both constrained vocabulary and open vocabulary conditions, achieving state-of-the-art performance in generating high-quality speech with high naturalness, intelligibility and speaker similarity. Our demo page is released here: https://wendison.github.io/VCVTS-demo/
翻译:尽管对依赖语音的视频到语音合成(VTS)取得了显著进展,但对能够将静音视频映射成语音的多发式VTS的多发式VTS却很少注意,这些视频可以将静音视频映射成语音,同时允许在一个单一系统中灵活控制语音身份。本文提议建立一个新型的多发式VTS系统,其基础是声音转换(VC)的跨模式知识传输,其中矢量量化与对比性预测编码(VQCPC)用于VC的内容编码器,以生成离散的语音类似的声音设备,这些设备被转移到Lipto-Index(Lip2Ind)网络,以推断音响器的索引序列。Lip2Ind网络随后可以取代VC的内容编码器,形成一个多发式VTS系统,将静音视频转换为音设备,以重建准确的语音内容。VTS系统还继承了VC的优势,其方法是使用发言人解说器制作语音演示,以有效控制所生成的演讲者身份。广泛的评价可以核实拟议方法的有效性,在高品质和高品质中应用。