Voice cloning is a difficult task which requires robust and informative features incorporated in a high quality TTS system in order to effectively copy an unseen speaker's voice. In our work, we utilize features learned in a self-supervised framework via the Bootstrap Your Own Latent (BYOL) method, which is shown to produce high quality speech representations when specific audio augmentations are applied to the vanilla algorithm. We further extend the augmentations in the training procedure to aid the resulting features to capture the speaker identity and to make them robust to noise and acoustic conditions. The learned features are used as pre-trained utterance-level embeddings and as inputs to a Non-Attentive Tacotron based architecture, aiming to achieve multispeaker speech synthesis without utilizing additional speaker features. This method enables us to train our model in an unlabeled multispeaker dataset as well as use unseen speaker embeddings to copy a speaker's voice. Subjective and objective evaluations are used to validate the proposed model, as well as the robustness to the acoustic conditions of the target utterance.
翻译:语音克隆是一项艰巨的任务,需要在一个高质量的 TTS 系统中注入强健和丰富的功能,以便有效地复制一个隐蔽的发言者的声音。 在我们的工作中,我们利用通过“自言自语”方法在自我监督的框架内学到的特征,在对香草算法应用特定的音频增强功能时,该方法显示能够产生高质量的语音表现。我们进一步扩展培训程序中的扩增功能,以帮助获取由此产生的功能,从而捕捉发言者的身份,使其对噪音和声响条件具有很强性。这些学习的特征被用作预先训练的语调层嵌入器,并作为非高级塔可乐结构的投入,目的是在不使用其他扬声器特点的情况下实现多语者语音合成。这一方法使我们能够用未加标签的多语器数据集来训练我们的模型,并利用隐蔽的语音嵌入式来复制发言者的声音。我们采用了主观和客观的评价来验证拟议的模式,以及目标语调状态的稳健性。