It is desirable for a text-to-speech system to take into account the environment where synthetic speech is presented, and provide appropriate context-dependent output to the user. In this paper, we present and compare various approaches for generating different speaking styles, namely, normal, Lombard, and whisper speech, using only limited data. The following systems are proposed and assessed: 1) Pre-training and fine-tuning a model for each style. 2) Lombard and whisper speech conversion through a signal processing based approach. 3) Multi-style generation using a single model based on a speaker verification model. Our mean opinion score and AB preference listening tests show that 1) we can generate high quality speech through the pre-training/fine-tuning approach for all speaking styles. 2) Although our speaker verification (SV) model is not explicitly trained to discriminate different speaking styles, and no Lombard and whisper voice is used for pre-training this system, the SV model can be used as a style encoder for generating different style embeddings as input for the Tacotron system. We also show that the resulting synthetic Lombard speech has a significant positive impact on intelligibility gain.
翻译:文本到语音系统最好能考虑到合成语句的展示环境,并向用户提供适合背景的输出。本文介绍并比较产生不同语调风格的各种方法,即普通语、伦巴德语和低语语,仅使用有限数据。以下系统的建议和评估:(1) 每种语调的预培训和微调模式。(2) 伦巴德语和轻声语音转换,采用基于信号处理的方法。(3) 使用以发言者校验模式为基础的单一模式的多式生成。我们的平均意见评分和AB喜好监听测试表明:(1) 我们可以通过所有语调的预培训/微调方法产生高质量的语言。(2) 尽管我们的语音核实(SV)模式没有经过明确培训,以区别不同语调,而且没有使用伦巴德语和低语声音对该系统进行预培训,但SV模式可以用作一种风格编码,生成不同风格的嵌入式,作为塔可调系统的投入。我们还表明,由此产生的合成伦巴德语系语言的合成话语调对不可转让性具有重大的积极影响。