The goal of this work is zero-shot text-to-speech synthesis, with speaking styles and voices learnt from facial characteristics. Inspired by the natural fact that people can imagine the voice of someone when they look at his or her face, we introduce a face-styled diffusion text-to-speech (TTS) model within a unified framework learnt from visible attributes, called Face-TTS. This is the first time that face images are used as a condition to train a TTS model. We jointly train cross-model biometrics and TTS models to preserve speaker identity between face images and generated speech segments. We also propose a speaker feature binding loss to enforce the similarity of the generated and the ground truth speech segments in speaker embedding space. Since the biometric information is extracted directly from the face image, our method does not require extra fine-tuning steps to generate speech from unseen and unheard speakers. We train and evaluate the model on the LRS3 dataset, an in-the-wild audio-visual corpus containing background noise and diverse speaking styles. The project page is https://facetts.github.io.
翻译:这项工作的目标是零光文本到语音合成,其语言风格和声音从面部特征中学习。受人们能够想象某人在看着其脸部时的声音这一自然事实的启发,我们在一个从可见属性中学习的统一框架内引入了面型扩散文本到语音模型(TTS)。这是第一次将脸部图像用作培训TTS模型的条件。我们联合培训跨模版生物鉴别学和TTS模型,以在脸部图像和生成的语音部分之间保护发言者身份。我们还提出一个带主语特征的束缚性损失,以强化发言者嵌入的空间中生成的和地面真实语言部分的相似性。由于生物鉴别信息直接从脸部图像中提取,我们的方法不需要额外的微调步骤来生成看不见和听不到的发言者的演讲。我们在LRS3数据集上培训和评价模型,这是一个包含背景噪音和多种语音风格的动态视听材料。项目页面是 https://pacets.github.io。</s>