In this paper, we propose an effective method to synthesize speaker-specific speech waveforms by conditioning on videos of an individual's face. Using a generative adversarial network (GAN) with linguistic and speaker characteristic features as auxiliary conditions, our method directly converts face images into speech waveforms under an end-to-end training framework. The linguistic features are extracted from lip movements using a lip-reading model, and the speaker characteristic features are predicted from face images using cross-modal learning with a pre-trained acoustic model. Since these two features are uncorrelated and controlled independently, we can flexibly synthesize speech waveforms whose speaker characteristics vary depending on the input face images. Therefore, our method can be regarded as a multi-speaker face-to-speech waveform model. We show the superiority of our proposed model over conventional methods in terms of both objective and subjective evaluation results. Specifically, we evaluate the performances of the linguistic feature and the speaker characteristic generation modules by measuring the accuracy of automatic speech recognition and automatic speaker/gender recognition tasks, respectively. We also evaluate the naturalness of the synthesized speech waveforms using a mean opinion score (MOS) test.
翻译:在本文中,我们提出一种有效的方法,通过对某人脸部的视频进行调制,合成特定发言者的语音波形。使用具有语言和语种特点作为辅助条件的基因对抗网络(GAN),我们的方法可以直接将脸部图像转换成一个端对端培训框架的语音波形。语言特征通过唇读模型从唇语运动中提取,而发言者的特征则通过使用经过预先训练的声学模型从脸部图像中预测。由于这两个特征不相干,而且独立控制,因此我们可以灵活合成其语音特征因输入面部图像而异的语音波形。因此,我们的方法可以被视为一个多语种面部对口语波形的波形模型。我们在客观和主观评价结果方面表现出我们提议的模型优于常规方法的优势。具体地说,我们通过测量自动语音识别的准确性以及自动演讲/性别识别任务,评估语言特征和发言者的特征生成模块的性能。我们还评估了使用一种中度分分度测试的合成语音波形体的自然性。