耳语和伦巴德神经言语综述 (Whispered and Lombard Neural Speech Synthesis)

It is desirable for a text-to-speech system to take into account the environment where synthetic speech is presented, and provide appropriate context-dependent output to the user. In this paper, we present and compare various approaches for generating different speaking styles, namely, normal, Lombard, and whisper speech, using only limited data. The following systems are proposed and assessed: 1) Pre-training and fine-tuning a model for each style. 2) Lombard and whisper speech conversion through a signal processing based approach. 3) Multi-style generation using a single model based on a speaker verification model. Our mean opinion score and AB preference listening tests show that 1) we can generate high quality speech through the pre-training/fine-tuning approach for all speaking styles. 2) Although our speaker verification (SV) model is not explicitly trained to discriminate different speaking styles, and no Lombard and whisper voice is used for pre-training this system, the SV model can be used as a style encoder for generating different style embeddings as input for the Tacotron system. We also show that the resulting synthetic Lombard speech has a significant positive impact on intelligibility gain.

翻译：文本到语音系统最好能考虑到合成语句的展示环境,并向用户提供适合背景的输出。本文介绍并比较产生不同语调风格的各种方法,即普通语、伦巴德语和低语语,仅使用有限数据。以下系统的建议和评估:(1) 每种语调的预培训和微调模式。(2) 伦巴德语和轻声语音转换,采用基于信号处理的方法。(3) 使用以发言者校验模式为基础的单一模式的多式生成。我们的平均意见评分和AB喜好监听测试表明:(1) 我们可以通过所有语调的预培训/微调方法产生高质量的语言。(2) 尽管我们的语音核实(SV)模式没有经过明确培训,以区别不同语调,而且没有使用伦巴德语和低语声音对该系统进行预培训,但SV模式可以用作一种风格编码,生成不同风格的嵌入式,作为塔可调系统的投入。我们还表明,由此产生的合成伦巴德语系语言的合成话语调对不可转让性具有重大的积极影响。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/