This study aims at designing an environment-aware text-to-speech (TTS) system that can generate speech to suit specific acoustic environments. It is also motivated by the desire to leverage massive data of speech audio from heterogeneous sources in TTS system development. The key idea is to model the acoustic environment in speech audio as a factor of data variability and incorporate it as a condition in the process of neural network based speech synthesis. Two embedding extractors are trained with two purposely constructed datasets for characterization and disentanglement of speaker and environment factors in speech. A neural network model is trained to generate speech from extracted speaker and environment embeddings. Objective and subjective evaluation results demonstrate that the proposed TTS system is able to effectively disentangle speaker and environment factors and synthesize speech audio that carries designated speaker characteristics and environment attribute. Audio samples are available online for demonstration https://daxintan-cuhk.github.io/Environment-Aware-TTS/ .
翻译:这项研究旨在设计一个环境觉识文本到语音系统,能够生成语音,以适应特定的声学环境,其动机还在于希望在TTS系统开发过程中利用来自不同来源的语音音频的大量数据,关键的想法是模拟语音音频中的音频环境,将其作为数据变异性的一个因素,并将它作为神经网络语音合成过程的一个条件。有两个嵌入式提取器经过培训,配有两套专门设计的数据集,用于语音和环境因素的定性和分离。神经网络模型经过培训,从提取的语音和环境嵌入中生成语音。客观和主观的评价结果表明,拟议的TTS系统能够有效地解析演讲者和环境因素,并合成带有指定演讲者特点和环境属性的语音音频。音频样本可在网上查阅,供演示 https://daxintan-cuhk.github.io/Environment-Aware-TTS/。