In this paper, we study the disentanglement of speaker and language representations in non-autoregressive cross-lingual TTS models from various aspects. We propose a phoneme length regulator that solves the length mismatch problem between IPA input sequence and monolingual alignment results. Using the phoneme length regulator, we present a FastPitch-based cross-lingual model with IPA symbols as input representations. Our experiments show that language-independent input representations (e.g. IPA symbols), an increasing number of training speakers, and explicit modeling of speech variance information all encourage non-autoregressive cross-lingual TTS model to disentangle speaker and language representations. The subjective evaluation shows that our proposed model can achieve decent naturalness and speaker similarity in cross-language voice cloning.
翻译:在本文中,我们研究了不同方面的非偏向跨语言TTS模式中演讲人和语言代表的分解问题;我们建议一个电话长度调节器,以解决IPA输入序列与单语比对结果之间的时间错配问题;我们使用电话长度调节器,提出了一个基于快速Pitch的跨语言模式,IPA符号作为投入表达。我们的实验表明,语言独立的投入代表(如IPA符号)、越来越多的培训演讲人以及语音差异信息明确模型,都鼓励非横向跨语言TTTS模式分离演讲人和语言代表。主观评估表明,我们提议的模式可以在跨语言克隆中实现体面的自然和语言相似性。