Although end-to-end text-to-speech (TTS) models can generate natural speech, challenges still remain when it comes to estimating sentence-level phonetic and prosodic information from raw text in Japanese TTS systems. In this paper, we propose a method for polyphone disambiguation (PD) and accent prediction (AP). The proposed method incorporates explicit features extracted from morphological analysis and implicit features extracted from pre-trained language models (PLMs). We use BERT and Flair embeddings as implicit features and examine how to combine them with explicit features. Our objective evaluation results showed that the proposed method improved the accuracy by 5.7 points in PD and 6.0 points in AP. Moreover, the perceptual listening test results confirmed that a TTS system employing our proposed model as a front-end achieved a mean opinion score close to that of synthesized speech with ground-truth pronunciation and accent in terms of naturalness.
翻译:虽然端至端文本到语音模型(TTS)可以产生自然的言语,但是在估计日本TTS系统原始文本的句级语音和预言信息方面仍然存在挑战。在本文件中,我们建议采用多语种断音和口音预测方法。拟议方法包含从形态分析中提取的明确特征和从预先培训的语言模型(PLMs)中提取的隐含特征。我们使用BERT和Flair嵌入作为隐含特征,并研究如何将它们与明确特征相结合。我们的客观评价结果显示,拟议方法提高了PD5.7点和AP6.0点的准确性。此外,概念性听觉测试结果证实,以我们提议的模型作为前端的TTS系统取得了接近以地光发音和自然特征为口音的合成语音的平均评分。