International Phonetic Alphabet (IPA) has been widely used in cross-lingual text-to-speech (TTS) to achieve cross-lingual voice cloning (CL VC). However, IPA itself has been understudied in cross-lingual TTS. In this paper, we report some empirical findings of building a cross-lingual TTS model using IPA as inputs. Experiments show that the way to process the IPA and suprasegmental sequence has a negligible impact on the CL VC performance. Furthermore, we find that using a dataset including one speaker per language to build an IPA-based TTS system would fail CL VC since the language-unique IPA and tone/stress symbols could leak the speaker information. In addition, we experiment with different combinations of speakers in the training dataset to further investigate the effect of the number of speakers on the CL VC performance.
翻译:国际语音字母(IPA)被广泛用于实现跨语言文本对语音克隆(CLVC),但是,IPA本身在跨语言TTS中研究不足。在本文中,我们报告了利用IPA作为投入建立跨语言TTS模型的一些经验结果。实验表明,处理IPA和超分类顺序的方法对CL VC的绩效影响微乎其微。此外,我们发现,使用数据集,包括每种语言一名发言者来建立基于IPA的TTS系统,将无法达到CL VC,因为语言通用IPA和语调/语调符号可能泄露演讲者信息。此外,我们在培训数据集中用不同的发言者组合进行试验,以进一步调查发言者人数对CL VC绩效的影响。