Expressive speech synthesis, like audiobook synthesis, is still challenging for style representation learning and prediction. Deriving from reference audio or predicting style tags from text requires a huge amount of labeled data, which is costly to acquire and difficult to define and annotate accurately. In this paper, we propose a novel framework for learning style representation from abundant plain text in a self-supervised manner. It leverages an emotion lexicon and uses contrastive learning and deep clustering. We further integrate the style representation as a conditioned embedding in a multi-style Transformer TTS. Comparing with multi-style TTS by predicting style tags trained on the same dataset but with human annotations, our method achieves improved results according to subjective evaluations on both in-domain and out-of-domain test sets in audiobook speech. Moreover, with implicit context-aware style representation, the emotion transition of synthesized audio in a long paragraph appears more natural. The audio samples are available on the demo web.
翻译:语言表达合成,像音频书合成一样,对于风格表达式的学习和预测来说,仍然具有挑战性。从参考的音频或预测文本样式标签中产生大量标签数据,需要大量标签数据,而获得这些数据成本高昂,难以准确定义和注释。在本文中,我们提出了一个创新框架,以自我监督的方式从丰富的普通文本中学习风格表述。它利用情感词典,并使用对比性学习和深层集群。我们进一步将风格表述作为条件嵌入多式变异器TTTTS。与多式TTS相比,通过预测在同一数据集中受过训练的风格标签,但与人文说明相对应,我们的方法根据对音频书演讲中的内部和外部测试集的主观评价,取得了更好的结果。此外,通过隐含背景识别风格的表达,合成音频在长段中的感应变。音样本可以在演示网上找到。