This paper describes the AS-NU systems for two tracks in MultiSpeaker Multi-Style Voice Cloning Challenge (M2VoC). The first track focuses on using a small number of 100 target utterances for voice cloning, while the second track focuses on using only 5 target utterances for voice cloning. Due to the serious lack of data in the second track, we selected the speaker most similar to the target speaker from the training data of the TTS system, and used the speaker's utterances and the given 5 target utterances to fine-tune our model. The evaluation results show that our systems on the two tracks perform similarly in terms of quality, but there is still a clear gap between the similarity score of the second track and the similarity score of the first track.
翻译:本文介绍多声音多立方声音克隆挑战(M2VoC)的两个轨道的AS-NU系统。第一轨道的重点是对语音克隆使用少量100个目标语句,而第二轨道只侧重于对语音克隆使用5个目标语句。由于第二轨道严重缺乏数据,我们从TTS系统的培训数据中选择了与目标发言者最相似的发言者,并使用发言者的语句和给定的5个目标语句来微调我们的模型。评价结果显示,我们两个轨道上的系统在质量方面表现相似,但第二轨道的相似性分数和第一轨道的相似性分数之间仍有明显差距。