In most cases, bilingual TTS needs to handle three types of input scripts: first language only, second language only, and second language embedded in the first language. In the latter two situations, the pronunciation and intonation of the second language are usually quite different due to the influence of the first language. Therefore, it is a big challenge to accurately model the pronunciation and intonation of the second language in different contexts without mutual interference. This paper builds a Mandarin-English TTS system to acquire more standard spoken English speech from a monolingual Chinese speaker. We introduce phonology embedding to capture the English differences between different phonology. Embedding mask is applied to language embedding for distinguishing information between different languages and to phonology embedding for focusing on English expression. We specially design an embedding strength modulator to capture the dynamic strength of language and phonology. Experiments show that our approach can produce significantly more natural and standard spoken English speech of the monolingual Chinese speaker. From analysis, we find that suitable phonology control contributes to better performance in different scenarios.
翻译:在多数情况下,双语TTS需要处理三种输入文字:第一语言只,第二语言只,第二语言嵌入第一语言。在后两种情况下,第二语言的发音和内化通常由于第一语言的影响而有很大不同。因此,精确地模拟不同情况下的第二语言发音和内化是巨大的挑战,没有相互干扰。本文建立了一个普通话-英语TTS系统,以便从一个单语中文演讲者那里获得更标准的英语口语。我们引入声学嵌入,以捕捉不同声调之间的英语差异。在后两种情况下,第二语言的发音和内化通常由于第一语言的影响而有很大不同。我们专门设计了一种嵌入力调节器,以捕捉语言和声调的动态强度。实验表明,我们的方法能够产生更自然和更标准的单语中文演讲者英语。我们从分析中发现,适当的声调控制有助于在不同情景下更好的表现。