This work focuses on modelling a speaker's accent that does not have a dedicated text-to-speech (TTS) frontend, including a grapheme-to-phoneme (G2P) module. Prior work on modelling accents assumes a phonetic transcription is available for the target accent, which might not be the case for low-resource, regional accents. In our work, we propose an approach whereby we first augment the target accent data to sound like the donor voice via voice conversion, then train a multi-speaker multi-accent TTS model on the combination of recordings and synthetic data, to generate the donor's voice speaking in the target accent. Throughout the procedure, we use a TTS frontend developed for the same language but a different accent. We show qualitative and quantitative analysis where the proposed strategy achieves state-of-the-art results compared to other generative models. Our work demonstrates that low resource accents can be modelled with relatively little data and without developing an accent-specific TTS frontend. Audio samples of our model converting to multiple accents are available on our web page.
翻译:这项工作侧重于模拟一个没有专用文本到语音口音(TTS)的发言者口音(TTS)前端,包括一个图形式口音(G2P)模块。以前关于模拟口音的工作假设目标口音可以提供语音转录,而对于低资源、区域口音来说可能不是这样。在我们的工作中,我们提出了一个方法,即我们首先通过语音转换来增加目标口音数据,使其听起来像捐赠者的声音,然后就录音和合成数据相结合而培训多语种多语种的多语种TTS模型,以生成捐赠者以目标口音说话的语音。在整个过程中,我们使用为同一语言但有不同口音而开发的TTS前端。我们展示了在拟议战略与其他基因化模型相比取得最新结果时的定性和定量分析。我们的工作表明,低资源口音可以通过相对较少的数据进行模拟,而不开发特定口音TTS前端。我们模式转换为多个口口音的音样可以在我们的网页上找到。