This paper proposes a method for selecting training data for text-to-speech (TTS) synthesis from dark data. TTS models are typically trained on high-quality speech corpora that cost much time and money for data collection, which makes it very challenging to increase speaker variation. In contrast, there is a large amount of data whose availability is unknown (a.k.a, "dark data"), such as YouTube videos. To utilize data other than TTS corpora, previous studies have selected speech data from the corpora on the basis of acoustic quality. However, considering that TTS models robust to data noise have been proposed, we should select data on the basis of its importance as training data to the given TTS model, not the quality of speech itself. Our method with a loop of training and evaluation selects training data on the basis of the automatically predicted quality of synthetic speech of a given TTS model. Results of evaluations using YouTube data reveal that our method outperforms the conventional acoustic-quality-based method.
翻译:本文建议了从暗数据中选择文本到语音合成培训数据的方法。 TTS模型通常在高质量的语音组合方面受过培训,这种组合花费了大量的时间和金钱来收集数据,因此很难增加发言者的变异性。相比之下,有大量数据(a.k.a,“暗数据”),如YouTube视频等,其可用性未知(a.k.a,“暗数据”)。为了使用TTS Corbora以外的数据,以前的研究根据声学质量从Corpora中选择了语音数据。然而,考虑到已经提出了对数据噪音具有活力的 TTS模型,我们应该根据这些数据作为特定 TTS模型培训数据的重要性而不是语言质量本身来选择数据。我们通过培训和评价循环选择了培训数据,其依据是特定TTS模型合成语言的自动预测质量。使用YouTube数据的评价结果显示,我们的方法超过了传统的声学质量方法。