以跨模式报告代表制为基础的多发言者面对面对语音模型 (Facetron: A Multi-speaker Face-to-Speech Model based on Cross-modal Latent Representations)

In this paper, we propose a multi-speaker face-to-speech waveform generation model that also works for unseen speaker conditions. Using a generative adversarial network (GAN) with linguistic and speaker characteristic features as auxiliary conditions, our method directly converts face images into speech waveforms under an end-to-end training framework. The linguistic features are extracted from lip movements using a lip-reading model, and the speaker characteristic features are predicted from face images using cross-modal learning with a pre-trained acoustic model. Since these two features are uncorrelated and controlled independently, we can flexibly synthesize speech waveforms whose speaker characteristics vary depending on the input face images. We show the superiority of our proposed model over conventional methods in terms of objective and subjective evaluation results. Specifically, we evaluate the performances of linguistic features by measuring their accuracy on an automatic speech recognition task. In addition, we estimate speaker and gender similarity for multi-speaker and unseen conditions, respectively. We also evaluate the aturalness of the synthesized speech waveforms using a mean opinion score (MOS) test and non-intrusive objective speech quality assessment (NISQA).The demo samples of the proposed and other models are available at https://sam-0927.github.io/

翻译：在本文中,我们提出一个多讲方对讲方波形生成模型,该模型也适用于隐蔽的演讲者条件。使用具有语言和发言特点作为辅助条件的基因对抗网络(GAN),我们的方法在端对端培训框架内直接将脸部图像转换成语音波形。语言特征通过唇读模型从嘴唇运动中提取,发言者特征通过使用经过预先训练的声学模型从面部图像中预测。由于这两个特征不相干,而且独立控制,我们可以灵活合成语音波形,其发言者特点因输入面像而不同。我们用客观和主观评价结果来显示我们所提议的模式优于传统方法。具体地说,我们通过测量语言特征在自动语音识别任务的准确性来评估语言特征的性能。此外,我们分别用经过预先训练的声学模型和看不见的声学模型来估计演讲者和性别相似性。我们还利用平均意见分(MOS)测试和非侵入性客观语言质量模型来评估综合语音波形形形形形形形形形形形色。在 MAS-Q 和A/DROB-A/A/A/ADMs/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A///////////////////////////////////A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/A/////A/A/

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

ICLR 2021杰出论文奖出炉，8篇论文上榜！

专知会员服务

26+阅读 · 2021年4月2日