We propose a novel phrase break prediction method that combines implicit features extracted from a pre-trained large language model, a.k.a BERT, and explicit features extracted from BiLSTM with linguistic features. In conventional BiLSTM based methods, word representations and/or sentence representations are used as independent components. The proposed method takes account of both representations to extract the latent semantics, which cannot be captured by previous methods. The objective evaluation results show that the proposed method obtains an absolute improvement of 3.2 points for the F1 score compared with BiLSTM-based conventional methods using linguistic features. Moreover, the perceptual listening test results verify that a TTS system that applied our proposed method achieved a mean opinion score of 4.39 in prosody naturalness, which is highly competitive with the score of 4.37 for synthesized speech with ground-truth phrase breaks.
翻译:我们建议一种新型的短语断裂预测方法,结合从预先训练过的大型语言模型(a.k.a.BERT)中提取的隐含特征和从BILSTM中提取的具有语言特征的清晰特征。在传统的BILSTM方法中,单词表达和/或句表述是作为独立的组成部分使用。拟议方法考虑到两种表达方式,以提取以前方法无法捕捉的潜在语义。客观评价结果显示,拟议方法F1分的绝对改善3.2分,而使用语言特征的BILSTM常规方法则改进了3.2分。此外,概念性倾听测试结果证实,采用我们拟议方法的TTS系统在行走自然性方面达到了4.39分的平均评分,这与4.37分的合成语义和地面真实语断裂非常有竞争力。