In this paper, we present AISHELL-3, a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers. Their auxiliary attributes such as gender, age group and native accents are explicitly marked and provided in the corpus. Accordingly, transcripts in Chinese character-level and pinyin-level are provided along with the recordings. We present a baseline system that uses AISHELL-3 for multi-speaker Madarin speech synthesis. The multi-speaker speech synthesis system is an extension on Tacotron-2 where a speaker verification model and a corresponding loss regarding voice similarity are incorporated as the feedback constraint. We aim to use the presented corpus to build a robust synthesis model that is able to achieve zero-shot voice cloning. The system trained on this dataset also generalizes well on speakers that are never seen in the training process. Objective evaluation results from our experiments show that the proposed multi-speaker synthesis system achieves high voice similarity concerning both speaker embedding similarity and equal error rate measurement. The dataset, baseline system code and generated samples are available online.
翻译:在本文中,我们展示了AISHELL-3,这是一个大型和高友谊多语种的多语种话文体,可用于培训多语种马达林话语合成系统;多语种语音合成系统是Tacotron-2的扩展,其发言人核查模型和声音相似性的相应损失都包含在Tacotron-2上,作为反馈限制。我们打算利用该软件建立一个强大的合成模型,以便能够实现零光语音克隆。在这个数据集上培训的系统也非常概括在培训过程中从未见过的发言者。我们进行的客观评价显示,拟议的多语种语言合成系统实现了类似的在线定位率。