We explore cross-lingual multi-speaker speech synthesis and cross-lingual voice conversion applied to data augmentation for automatic speech recognition (ASR) systems. Through extensive experiments, we show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems on a target language using only one target-language speaker during model training. We managed to close the gap between ASR models trained with synthesized versus human speech compared to other works that use many speakers. Finally, we show that it is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
翻译:我们探索跨语言多语种语音合成和跨语言语音转换,用于自动语音识别系统的数据增强;通过广泛实验,我们表明,我们的方法允许应用语言合成和语音转换,在示范培训中只使用一名目标语言演讲者,改进目标语言上的ASR系统;我们设法缩小了经过合成语言培训的ASR模型与使用许多发言者的其他作品之间的差距;最后,我们表明,利用数据增强方法,仅使用一个目标语言的单一真正演讲者,就有可能获得有希望的ASR培训结果。