Most text-to-speech (TTS) methods use high-quality speech corpora recorded in a well-designed environment, incurring a high cost for data collection. To solve this problem, existing noise-robust TTS methods are intended to use noisy speech corpora as training data. However, they only address either time-invariant or time-variant noises. We propose a degradation-robust TTS method, which can be trained on speech corpora that contain both additive noises and environmental distortions. It jointly represents the time-variant additive noises with a frame-level encoder and the time-invariant environmental distortions with an utterance-level encoder. We also propose a regularization method to attain clean environmental embedding that is disentangled from the utterance-dependent information such as linguistic contents and speaker characteristics. Evaluation results show that our method achieved significantly higher-quality synthetic speech than previous methods in the condition including both additive noise and reverberation.
翻译:大多数文本到语音方法(TTS)使用在设计良好的环境中记录的高质量语音组合体,这为数据收集带来高昂的成本。为了解决这一问题,现有的噪音-机器人TTS方法意在使用噪音-声音组合体作为培训数据。然而,它们只处理时间变化或时间变化的噪音。我们建议一种降解-机器人组合体方法,可以对包含添加噪音和环境扭曲的语音组合体进行培训。它共同代表时间变化性添加噪音,带有框架级编码器,以及时间变化性环境扭曲,带有发音级编码器。我们还提议一种正规化方法,实现清洁的环境嵌入,这种嵌入与语言内容和发言者特点等依赖言词的信息脱钩。评价结果表明,我们的方法在条件方面,包括添加噪音和回响方面,比以前的方法质量要高得多。