客观评估多声音深神经语音合成中记录条件和发言者特点的影响 (An objective evaluation of the effects of recording conditions and speaker characteristics in multi-speaker deep neural speech synthesis)

Multi-speaker spoken datasets enable the creation of text-to-speech synthesis (TTS) systems which can output several voice identities. The multi-speaker (MSPK) scenario also enables the use of fewer training samples per speaker. However, in the resulting acoustic model, not all speakers exhibit the same synthetic quality, and some of the voice identities cannot be used at all. In this paper we evaluate the influence of the recording conditions, speaker gender, and speaker particularities over the quality of the synthesised output of a deep neural TTS architecture, namely Tacotron2. The evaluation is possible due to the use of a large Romanian parallel spoken corpus containing over 81 hours of data. Within this setup, we also evaluate the influence of different types of text representations: orthographic, phonetic, and phonetic extended with syllable boundaries and lexical stress markings. We evaluate the results of the MSPK system using the objective measures of equal error rate (EER) and word error rate (WER), and also look into the distances between natural and synthesised t-SNE projections of the embeddings computed by an accurate speaker verification network. The results show that there is indeed a large correlation between the recording conditions and the speaker's synthetic voice quality. The speaker gender does not influence the output, and that extending the input text representation with syllable boundaries and lexical stress information does not equally enhance the generated audio across all speaker identities. The visualisation of the t-SNE projections of the natural and synthesised speaker embeddings show that the acoustic model shifts some of the speakers' neural representation, but not all of them. As a result, these speakers have lower performances of the output speech.

翻译：多发者口音数据集能够创建能够输出若干声音身份的文本到语音合成系统。多发者口语合成(TTS)假设也能够使每个发言者使用较少的培训样本。但是,在由此产生的声学模型中,并非所有发言者都表现出同样的合成质量,而且一些声音身份根本无法使用。在本文中,我们评估了记录条件、发言者性别和发言者对深神经音合成合成结构(即Tacotron2)质量的影响。之所以可能进行评估,是因为使用了包含81小时以上数据的大型罗马尼亚平行口语材料。在这个设置中,我们还评估了不同类型文本表达方式的影响力:Orthphy、超音频和超音速,扩展了可调边界和词压压力标记。我们用相同误率(EER)和单词表达器错误率(WER)等客观计量来评估MTK系统的结果,还查看了音频和合成SNE的音频表达器之间的距离,但是由于使用包含81小时数据的大型平行口语系口语系口语材料的预测,我们评估结果并没有通过准确的语音和合成语言输出来显示。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。