Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech (TTS) systems. However, previous approaches require substantial annotated training data and additional efforts from language experts, making it difficult to extend high-quality neural TTS systems to out-of-domain daily conversations and countless languages worldwide. This paper tackles the polyphone disambiguation problem from a concise and novel perspective: we propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary (the existing prior information in the natural language). Specifically, we design a semantics-to-pronunciation attention (S2PA) module to match the semantic patterns between the input text sequence and the prior semantics in the dictionary and obtain the corresponding pronunciations; The S2PA module can be easily trained with the end-to-end TTS model without any annotated phoneme labels. Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy and improves the prosody modeling of TTS systems. Further extensive analyses with different linguistic encoders demonstrate that each design in Dict-TTS is effective. Audio samples are available at \url{https://dicttts.github.io/DictTTS-Demo/}.
翻译:电话变音器旨在从自然文本序列中获取准确的读音知识,用于可靠的文本到语音系统(TTS),然而,以往的做法需要大量附加说明的培训数据和语言专家的更多努力,从而难以将高质量的神经 TTS 系统扩展至外部日常对话以及世界各地无数语言。本文从简明和新颖的角度处理多语调变音问题:我们提议Dict-TTS,这是一个带有在线网站字典的语义学识别的变异文本到语音模型(先前以自然语言提供的信息)。具体地说,我们设计了一个语义学变发音注意(S2PA)模块,以匹配输入文本序列和先前的语义表达模式,并获得相应的发音;S2PA模块可以在没有附加说明的语音标签的情况下很容易与端对端 TTS 模式进行训练。实验结果显示,我们的模型超越了在ProdunualTS/Developments Protutionaltroductions settroductions。