In this paper, we propose a multi-speaker face-to-speech waveform generation model that also works for unseen speaker conditions. Using a generative adversarial network (GAN) with linguistic and speaker characteristic features as auxiliary conditions, our method directly converts face images into speech waveforms under an end-to-end training framework. The linguistic features are extracted from lip movements using a lip-reading model, and the speaker characteristic features are predicted from face images using cross-modal learning with a pre-trained acoustic model. Since these two features are uncorrelated and controlled independently, we can flexibly synthesize speech waveforms whose speaker characteristics vary depending on the input face images. We show the superiority of our proposed model over conventional methods in terms of objective and subjective evaluation results. Specifically, we evaluate the performances of linguistic features by measuring their accuracy on an automatic speech recognition task. In addition, we estimate speaker and gender similarity for multi-speaker and unseen conditions, respectively. We also evaluate the aturalness of the synthesized speech waveforms using a mean opinion score (MOS) test and non-intrusive objective speech quality assessment (NISQA).The demo samples of the proposed and other models are available at https://sam-0927.github.io/
翻译:在本文中,我们提出一个多讲方对讲方波形生成模型,该模型也适用于隐蔽的演讲者条件。使用具有语言和发言特点作为辅助条件的基因对抗网络(GAN),我们的方法在端对端培训框架内直接将脸部图像转换成语音波形。语言特征通过唇读模型从嘴唇运动中提取,发言者特征通过使用经过预先训练的声学模型从面部图像中预测。由于这两个特征不相干,而且独立控制,我们可以灵活合成语音波形,其发言者特点因输入面像而不同。我们用客观和主观评价结果来显示我们所提议的模式优于传统方法。具体地说,我们通过测量语言特征在自动语音识别任务的准确性来评估语言特征的性能。此外,我们分别用经过预先训练的声学模型和看不见的声学模型来估计演讲者和性别相似性。我们还利用平均意见分(MOS)测试和非侵入性客观语言质量模型来评估综合语音波形形形形形形形形形形形色。在 MAS-Q 和A/DROB-A/A/A/ADMs/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A///////////////////////////////////A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/////A/A/