We propose an end-to-end empathetic dialogue speech synthesis (DSS) model that considers both the linguistic and prosodic contexts of dialogue history. Empathy is the active attempt by humans to get inside the interlocutor in dialogue, and empathetic DSS is a technology to implement this act in spoken dialogue systems. Our model is conditioned by the history of linguistic and prosody features for predicting appropriate dialogue context. As such, it can be regarded as an extension of the conventional linguistic-feature-based dialogue history modeling. To train the empathetic DSS model effectively, we investigate 1) a self-supervised learning model pretrained with large speech corpora, 2) a style-guided training using a prosody embedding of the current utterance to be predicted by the dialogue context embedding, 3) a cross-modal attention to combine text and speech modalities, and 4) a sentence-wise embedding to achieve fine-grained prosody modeling rather than utterance-wise modeling. The evaluation results demonstrate that 1) simply considering prosodic contexts of the dialogue history does not improve the quality of speech in empathetic DSS and 2) introducing style-guided training and sentence-wise embedding modeling achieves higher speech quality than that by the conventional method.
翻译:我们提出一个端到端对端对话语言合成模式,既考虑对话历史的语言背景,也考虑对话历史的发源地背景。 " 同情 " 是指人类在对话中积极尝试进入对话者内部, " 同情 " DSS " 是在口声对话系统中实施这种行为的一种技术。我们的模型取决于语言和手动特征的历史,以预测适当的对话背景。因此,可以被视为传统语言-地貌对话历史模型的延伸。为了有效培训同情对话历史模型,我们调查:(1) 一个自我监督学习模式,先用大语言囊体进行自我监督培训,(2) 一种风格指导培训,使用对话环境嵌入时预知的当前语句流的模拟嵌入,(3) 交叉关注文本和语音模式相结合,(4) 以句语调嵌入为主,以实现精细的节度模拟模式,而不是直言式模拟。评估结果表明,1) 仅考虑采用偏向型语言背景的学习模式,先用大语言囊体来训练,2) 一种风格指导培训,使用风格指导式培训,用对话环境来预测当前表达的语调的质量,由对话环境嵌入,3) 以及制式教学制式教学方法,不会提高语言教学方法质量。