We present a comprehensive empirical study for personalized spontaneous speech synthesis on the basis of linguistic knowledge. With the advent of voice cloning for reading-style speech synthesis, a new voice cloning paradigm for human-like and spontaneous speech synthesis is required. We, therefore, focus on personalized spontaneous speech synthesis that can clone both the individual's voice timbre and speech disfluency. Specifically, we deal with filled pauses, a major source of speech disfluency, which is known to play an important role in speech generation and communication in psychology and linguistics. To comparatively evaluate personalized filled pause insertion and non-personalized filled pause prediction methods, we developed a speech synthesis method with a non-personalized external filled pause predictor trained with a multi-speaker corpus. The results clarify the position-word entanglement of filled pauses, i.e., the necessity of precisely predicting positions for naturalness and the necessity of precisely predicting words for individuality on the evaluation of synthesized speech.
翻译:我们提出了基于语言知识的个人化自发语音合成的全面经验性研究。随着读式语音合成的语音克隆的出现,需要一个新的类似人和自发语音合成的语音克隆范式。因此,我们侧重于个性化自发语音合成,这种合成可以克隆个人的语音屏障和言语不通性。具体地说,我们处理的是填充的暂停,这是语言不通的一个主要原因,众所周知,它对于在心理学和语言的语音生成和沟通中起着重要作用。为了比较评价个性化的已填充暂停插入和非个性化的停顿预测方法,我们开发了一种语音合成方法,配有非个性化外部填充充的预告器,受过多语体的训练。结果澄清了填充的暂停的用词缠绕,即必须准确预测自然特性的位置,以及必须准确预测对评价合成语音的单个性。