以 DNN 为基础的多发言者演讲合成的演讲人核查所得损失和数据增加 (Speaker verification-derived loss and data augmentation for DNN-based multispeaker speech synthesis)

Building multispeaker neural network-based text-to-speech synthesis systems commonly relies on the availability of large amounts of high quality recordings from each speaker and conditioning the training process on the speaker's identity or on a learned representation of it. However, when little data is available from each speaker, or the number of speakers is limited, the multispeaker TTS can be hard to train and will result in poor speaker similarity and naturalness. In order to address this issue, we explore two directions: forcing the network to learn a better speaker identity representation by appending an additional loss term; and augmenting the input data pertaining to each speaker using waveform manipulation methods. We show that both methods are efficient when evaluated with both objective and subjective measures. The additional loss term aids the speaker similarity, while the data augmentation improves the intelligibility of the multispeaker TTS system.

翻译：建筑多声音神经网络的文本到语音合成系统通常取决于每个发言者能否提供大量高质量的录音资料,并使培训过程以发言者的身份或所了解的情况为条件,然而,如果每个发言者没有多少数据,或发言者人数有限,多声音 TTS可能难以培训,导致发言者的相似性和自然性差。为了解决这一问题,我们探讨两个方向:通过附加一个损失词,迫使网络学习更好的发言者身份说明;利用波形操纵方法,增加与每个发言者有关的输入数据。我们表明,在用客观和主观措施评价这两种方法时,都十分有效。额外的损失术语有助于发言者的相似性,而数据增强则提高了多声音 TTS系统的智能性。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

【斯坦福大学课程】2021年深度多任务学习与元学习，CS 330: Deep Multi-Task and Meta Learning

专知会员服务

110+阅读 · 2022年3月2日

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

专知会员服务

135+阅读 · 2021年6月16日