This paper proposes a multilingual speech synthesis method which combines unsupervised phonetic representations (UPR) and supervised phonetic representations (SPR) to avoid reliance on the pronunciation dictionaries of target languages. In this method, a pretrained wav2vec 2.0 model is adopted to extract UPRs and a language-independent automatic speech recognition (LI-ASR) model is built with a connectionist temporal classification (CTC) loss to extract segment-level SPRs from the audio data of target languages. Then, an acoustic model is designed, which first predicts UPRs and SPRs from texts separately and then combines the predicted UPRs and SPRs to generate mel-spectrograms. The results of our experiments on six languages show that the proposed method outperformed the methods that directly predicted mel-spectrograms from character or phoneme sequences and the ablated models that utilized only UPRs or SPRs.
翻译:本文建议采用多语种语言合成方法,将无人监督的语音表达和受监督的语音表达结合起来,以避免依赖目标语言的发音词典,在这一方法中,采用了预先培训的 wav2vec 2.0 模式来提取普遍定期审议,并用连接器时间分类(CTC)损失来构建一个语言独立的自动语音识别模式,以便从目标语言的音频数据中提取部分级别SPR。然后,设计了一个声学模型,首先从文本中分别预测普遍定期审议和SPR,然后将预测的普遍定期审议和SPR结合起来,生成中文本。我们六种语言的实验结果表明,拟议方法超过了直接预测字符或电话序列中的Mel-spectrogram以及仅使用普遍定期审议或SPR的布局模型的方法。