Variational auto-encoder (VAE) is an effective neural network architecture to disentangle a speech utterance into speaker identity and linguistic content latent embeddings, then generate an utterance for a target speaker from that of a source speaker. This is possible by concatenating the identity embedding of the target speaker and the content embedding of the source speaker uttering a desired sentence. In this work, we propose to improve VAE models with self-attention and structural regularization (RGSM). Specifically, we found a suitable location of VAE's decoder to add a self-attention layer for incorporating non-local information in generating a converted utterance and hiding the source speaker's identity. We applied relaxed group-wise splitting method (RGSM) to regularize network weights and remarkably enhance generalization performance. In experiments of zero-shot many-to-many voice conversion task on VCTK data set, with the self-attention layer and relaxed group-wise splitting method, our model achieves a gain of speaker classification accuracy on unseen speakers by 28.3\% while slightly improved conversion voice quality in terms of MOSNet scores. Our encouraging findings point to future research on integrating more variety of attention structures in VAE framework while controlling model size and overfitting for advancing zero-shot many-to-many voice conversions.
翻译:自动读取器(VAE)是一个有效的神经网络架构,将语音表达器分解成语音身份和语言内容潜伏的嵌入器,然后从源演讲者的角度为目标演讲者提供话语,这可以通过将目标演讲者的身份嵌入和源演讲者表达想要的句子的内容嵌入等同起来来实现。在这项工作中,我们提议用自我注意和结构规范(RGSM)来改进 VAE 模式。具体地说,我们找到了VAE 解调器的适当位置,以添加一个自我注意层,将非本地信息纳入生成转换的语音表达器和隐藏源演讲者的身份。我们采用了宽松的分组分裂法(RGSM)来规范网络重量和源演讲者表达器内容嵌入内容。在试验VCTK 数据集的零点多到中语音转换任务时,用自我注意层和群体分裂法宽松的方法,我们的模式在将远方发言人的分类精确度上增加了28.3-10分,同时将语音质量略地转换成未来SNet框架,同时将许多项研究结果的零位转换成升级。