We present a neural analysis and synthesis (NANSY) framework that can manipulate voice, pitch, and speed of an arbitrary speech signal. Most of the previous works have focused on using information bottleneck to disentangle analysis features for controllable synthesis, which usually results in poor reconstruction quality. We address this issue by proposing a novel training strategy based on information perturbation. The idea is to perturb information in the original input signal (e.g., formant, pitch, and frequency response), thereby letting synthesis networks selectively take essential attributes to reconstruct the input signal. Because NANSY does not need any bottleneck structures, it enjoys both high reconstruction quality and controllability. Furthermore, NANSY does not require any labels associated with speech data such as text and speaker information, but rather uses a new set of analysis features, i.e., wav2vec feature and newly proposed pitch feature, Yingram, which allows for fully self-supervised training. Taking advantage of fully self-supervised training, NANSY can be easily extended to a multilingual setting by simply training it with a multilingual dataset. The experiments show that NANSY can achieve significant improvement in performance in several applications such as zero-shot voice conversion, pitch shift, and time-scale modification.
翻译:我们提出了一个神经分析和合成(NANSY)框架,可以操纵语音、声调和任意语音信号的速度,以往的工作大多侧重于利用信息瓶颈分解可控合成的分析特征,通常导致重建质量差。我们通过基于信息扰动的新型培训战略解决这一问题。我们的想法是在原始输入信号(如成型、声频和频率响应)中干扰信息,从而让合成网络有选择地使用必要的属性来重建输入信号。由于NANSY不需要任何瓶颈结构,它既具有高重建质量又具有高可控性。此外,NANSY不需要与语音数据(如文字和语音信息)相关的任何标签,而是使用一套新的分析特征,即Wav2vec特征和新提议的音频特征(Yingram),它允许完全自我监督的培训。利用完全自我监督的培训,NANSY可以很容易地扩展到多语种设置环境环境,只需用多语种语音数据转换来训练它,从而实现显著的升级。NASYANSY能够实现显著的升级。