Text-to-speech (TTS) and singing voice synthesis (SVS) aim at generating high-quality speaking and singing voice according to textual input and music scores, respectively. Unifying TTS and SVS into a single system is crucial to the applications requiring both of them. Existing methods usually suffer from some limitations, which rely on either both singing and speaking data from the same person or cascaded models of multiple tasks. To address these problems, a simplified elegant framework for TTS and SVS, named UniSyn, is proposed in this paper. It is an end-to-end unified model that can make a voice speak and sing with only singing or speaking data from this person. To be specific, a multi-conditional variational autoencoder (MC-VAE), which constructs two independent latent sub-spaces with the speaker- and style-related (i.e. speak or sing) conditions for flexible control, is proposed in UniSyn. Moreover, supervised guided-VAE and timbre perturbation with the Wasserstein distance constraint are leveraged to further disentangle the speaker timbre and style. Experiments conducted on two speakers and two singers demonstrate that UniSyn can generate natural speaking and singing voice without corresponding training data. The proposed approach outperforms the state-of-the-art end-to-end voice generation work, which proves the effectiveness and advantages of UniSyn.
翻译:将 TTS 和 SVS 合并为一个单一的系统对于需要这两个系统的应用程序至关重要。 现有方法通常有一些局限性,它们依赖于同一个人的唱歌和说话数据,或多任务系列模式。 为解决这些问题,本文件提出了名为UniSyn的 TTS 和 SVS 的简化优雅框架。 这是一个端到端的统一模式,可以只用这个人的唱歌或说话数据进行语音和歌唱。要具体化,一个多条件的自动变换器(MC-VAE),它可以与同一个人或同一种风格(即说或唱)相关的方式建立两个独立的潜在子空间来进行灵活控制。 此外,在UniSyn 中提出了一个名为 UniSyn 的简化的语音和 SVAE 和 Timmbreamle perbation 的精度框架。 将瓦瑟斯坦远程限制作为一种终端的统一模式,可以使演讲人说话和唱音或讲音的数据更难听。 Snistringal- speaces 运行两个拟议 Streal-lavels 演示式,可以不通过双级的语音和 Stivalfrodal-dal-drodustrings