UniSyn: 文本到语音和唱歌语音合成的终端到终端统一模型 (UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis)

Text-to-speech (TTS) and singing voice synthesis (SVS) aim at generating high-quality speaking and singing voice according to textual input and music scores, respectively. Unifying TTS and SVS into a single system is crucial to the applications requiring both of them. Existing methods usually suffer from some limitations, which rely on either both singing and speaking data from the same person or cascaded models of multiple tasks. To address these problems, a simplified elegant framework for TTS and SVS, named UniSyn, is proposed in this paper. It is an end-to-end unified model that can make a voice speak and sing with only singing or speaking data from this person. To be specific, a multi-conditional variational autoencoder (MC-VAE), which constructs two independent latent sub-spaces with the speaker- and style-related (i.e. speak or sing) conditions for flexible control, is proposed in UniSyn. Moreover, supervised guided-VAE and timbre perturbation with the Wasserstein distance constraint are leveraged to further disentangle the speaker timbre and style. Experiments conducted on two speakers and two singers demonstrate that UniSyn can generate natural speaking and singing voice without corresponding training data. The proposed approach outperforms the state-of-the-art end-to-end voice generation work, which proves the effectiveness and advantages of UniSyn.

翻译：将 TTS 和 SVS 合并为一个单一的系统对于需要这两个系统的应用程序至关重要。现有方法通常有一些局限性,它们依赖于同一个人的唱歌和说话数据,或多任务系列模式。为解决这些问题,本文件提出了名为UniSyn的 TTS 和 SVS 的简化优雅框架。这是一个端到端的统一模式,可以只用这个人的唱歌或说话数据进行语音和歌唱。要具体化,一个多条件的自动变换器(MC-VAE),它可以与同一个人或同一种风格(即说或唱)相关的方式建立两个独立的潜在子空间来进行灵活控制。此外,在UniSyn 中提出了一个名为 UniSyn 的简化的语音和 SVAE 和 Timmbreamle perbation 的精度框架。将瓦瑟斯坦远程限制作为一种终端的统一模式,可以使演讲人说话和唱音或讲音的数据更难听。 Snistringal- speaces 运行两个拟议 Streal-lavels 演示式,可以不通过双级的语音和 Stivalfrodal-dal-drodustrings

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日