Training multilingual Neural Text-To-Speech (NTTS) models using only monolingual corpora has emerged as a popular way for building voice cloning based Polyglot NTTS systems. In order to train these models, it is essential to understand how the composition of the training corpora affects the quality of multilingual speech synthesis. In this context, it is common to hear questions such as "Would including more Spanish data help my Italian synthesis, given the closeness of both languages?". Unfortunately, we found existing literature on the topic lacking in completeness in this regard. In the present work, we conduct an extensive ablation study aimed at understanding how various factors of the training corpora, such as language family affiliation, gender composition, and the number of speakers, contribute to the quality of Polyglot synthesis. Our findings include the observation that female speaker data are preferred in most scenarios, and that it is not always beneficial to have more speakers from the target language variant in the training corpus. The findings herein are informative for the process of data procurement and corpora building.
翻译:仅使用单语语种的多语言文字和语言培训(NTTS)模式已成为建立以语音克隆为基础的多语种NTTS系统的一种流行方式。为了培训这些模式,必须了解培训公司的组成如何影响多语种语言合成的质量。在这方面,通常会听到诸如“考虑到两种语言的近距离,包括更多的西班牙数据是否有助于我的意大利综合?”这样的问题。不幸的是,我们发现关于这个专题的现有文献在这方面缺乏完整性。在目前的工作中,我们进行了广泛的对比研究,目的是了解培训公司的各种因素,例如语言家庭归属、性别组成和发言者人数,如何有助于提高多语种合成的质量。我们的调查结果包括,在多数情况下,女性演讲者的数据比较可取,在培训材料中,让更多的演讲者从目标语言变型语言变型中发言并不总是有好处。这里的研究结果对数据采购过程和公司建设提供了信息。