The availability of data in expressive styles across languages is limited, and recording sessions are costly and time consuming. To overcome these issues, we demonstrate how to build low-resource, neural text-to-speech (TTS) voices with only 1 hour of conversational speech, when no other conversational data are available in the same language. Assuming the availability of non-expressive speech data in that language, we propose a 3-step technology: 1) we train an F0-conditioned voice conversion (VC) model as data augmentation technique; 2) we train an F0 predictor to control the conversational flavour of the voice-converted synthetic data; 3) we train a TTS system that consumes the augmented data. We prove that our technology enables F0 controllability, is scalable across speakers and languages and is competitive in terms of naturalness over a state-of-the-art baseline model, another augmented method which does not make use of F0 information.
翻译:为了克服这些问题,我们展示了如何建立低资源、神经文本到语音的语音,只有1小时的谈话语言,而没有其它语言的谈话数据。假设有非表达语言的语音数据,我们建议采用一个三步技术:1)我们培训一个F0条件语音转换模型,作为数据增强技术;2)我们培训一个F0预测器,以控制语音转换合成数据的谈话口味;3)我们培训一个耗用扩大的数据的TTS系统。我们证明,我们的技术能够控制F0,在语言和语言之间是可扩缩的,在自然性方面对最先进的基线模型具有竞争力,这是另一个不使用F0信息的强化方法。