In this paper, we study the controllability of an Expressive TTS system trained on a dataset for a continuous control. The dataset is the Blizzard 2013 dataset based on audiobooks read by a female speaker containing a great variability in styles and expressiveness. Controllability is evaluated with both an objective and a subjective experiment. The objective assessment is based on a measure of correlation between acoustic features and the dimensions of the latent space representing expressiveness. The subjective assessment is based on a perceptual experiment in which users are shown an interface for Controllable Expressive TTS and asked to retrieve a synthetic utterance whose expressiveness subjectively corresponds to that a reference utterance.
翻译:在本文中,我们研究了在数据集方面受过持续控制培训的显性 TTS系统的可控性,数据集是Blizzard 2013年的数据集,该数据集以女演讲人阅读的音频书为基础,该音频书的风格和表达性有很大的变异性;可控性以客观和主观的实验加以评价;客观评估基于声学特征与代表表达性的潜在空间维度之间的相关性度量;主观评估基于感知性实验,其中向用户展示了可控表达 TTS的界面,并被要求检索一个合成话,其表达性主观上与引用语相匹配的合成话。