Various applications of voice synthesis have been developed independently despite the fact that they generate "voice" as output in common. In addition, most of the voice synthesis models still require a large number of audio data paired with annotated labels (e.g., text transcription and music score) for training. To this end, we propose a unified framework of synthesizing and manipulating voice signals from analysis features, dubbed NANSY++. The backbone network of NANSY++ is trained in a self-supervised manner that does not require any annotations paired with audio. After training the backbone network, we efficiently tackle four voice applications - i.e. voice conversion, text-to-speech, singing voice synthesis, and voice designing - by partially modeling the analysis features required for each task. Extensive experiments show that the proposed framework offers competitive advantages such as controllability, data efficiency, and fast training convergence, while providing high quality synthesis. Audio samples: tinyurl.com/8tnsy3uc.
翻译:语音合成的各种应用是独立开发的,尽管它们产生“声音”作为共同产出。此外,大多数语音合成模型仍然需要大量的音频数据,并配有附加说明的标签(如文字抄录和音乐评分)来进行培训。为此,我们提议了一个统一的框架,从分析特征(称为NANSY+++)中合成和操纵语音信号。NANSY++的主干网络以自我监督的方式接受培训,不需要任何与音频相配的说明。在对主干网络进行培训后,我们通过对每项任务所需的分析特征进行部分建模,有效地处理四种语音应用,即语音转换、文字对语音、语音合成和语音设计。广泛的实验表明,拟议的框架在提供高质量合成的同时,提供了可控性、数据效率和快速培训融合等竞争优势。音频样本:小于url.com/8nsy3uc。