Multi-speaker spoken datasets enable the creation of text-to-speech synthesis (TTS) systems which can output several voice identities. The multi-speaker (MSPK) scenario also enables the use of fewer training samples per speaker. However, in the resulting acoustic model, not all speakers exhibit the same synthetic quality, and some of the voice identities cannot be used at all. In this paper we evaluate the influence of the recording conditions, speaker gender, and speaker particularities over the quality of the synthesised output of a deep neural TTS architecture, namely Tacotron2. The evaluation is possible due to the use of a large Romanian parallel spoken corpus containing over 81 hours of data. Within this setup, we also evaluate the influence of different types of text representations: orthographic, phonetic, and phonetic extended with syllable boundaries and lexical stress markings. We evaluate the results of the MSPK system using the objective measures of equal error rate (EER) and word error rate (WER), and also look into the distances between natural and synthesised t-SNE projections of the embeddings computed by an accurate speaker verification network. The results show that there is indeed a large correlation between the recording conditions and the speaker's synthetic voice quality. The speaker gender does not influence the output, and that extending the input text representation with syllable boundaries and lexical stress information does not equally enhance the generated audio across all speaker identities. The visualisation of the t-SNE projections of the natural and synthesised speaker embeddings show that the acoustic model shifts some of the speakers' neural representation, but not all of them. As a result, these speakers have lower performances of the output speech.
翻译:多发者口音数据集能够创建能够输出若干声音身份的文本到语音合成系统。多发者口语合成(TTS)假设也能够使每个发言者使用较少的培训样本。 但是,在由此产生的声学模型中,并非所有发言者都表现出同样的合成质量,而且一些声音身份根本无法使用。在本文中,我们评估了记录条件、发言者性别和发言者对深神经音合成合成结构(即Tacotron2)质量的影响。之所以可能进行评估,是因为使用了包含81小时以上数据的大型罗马尼亚平行口语材料。在这个设置中,我们还评估了不同类型文本表达方式的影响力:Orthphy、超音频和超音速,扩展了可调边界和词压压力标记。我们用相同误率(EER)和单词表达器错误率(WER)等客观计量来评估MTK系统的结果,还查看了音频和合成SNE的音频表达器之间的距离,但是由于使用包含81小时数据的大型平行口语系口语系口语材料的预测,我们评估结果并没有通过准确的语音和合成语言输出来显示。