Disentangling content and speaking style information is essential for zero-shot non-parallel voice conversion (VC). Our previous study investigated a novel framework with disentangled sequential variational autoencoder (DSVAE) as the backbone for information decomposition. We have demonstrated that simultaneous disentangling content embedding and speaker embedding from one utterance is feasible for zero-shot VC. In this study, we continue the direction by raising one concern about the prior distribution of content branch in the DSVAE baseline. We find the random initialized prior distribution will force the content embedding to reduce the phonetic-structure information during the learning process, which is not a desired property. Here, we seek to achieve a better content embedding with more phonetic information preserved. We propose conditional DSVAE, a new model that enables content bias as a condition to the prior modeling and reshapes the content embedding sampled from the posterior distribution. In our experiment on the VCTK dataset, we demonstrate that content embeddings derived from the conditional DSVAE overcome the randomness and achieve a much better phoneme classification accuracy, a stabilized vocalization and a better zero-shot VC performance compared with the competitive DSVAE baseline.
翻译:调离内容和语音风格信息对于零发非平行语音转换( VC) 至关重要。 我们先前的研究调查了一个新框架, 以分解的顺序变异自动编码器( DSVAE) 作为信息分解的主柱。 我们已经证明, 零发 VC 可以同时将内容嵌入和发言者嵌入一个词句中。 在本研究中, 我们通过对 DSVAE 基线中内容的先前分布提出一个关切, 继续这个方向。 我们发现, 随机初始化的先前发布将迫使内容嵌入减少学习过程中的语音结构信息, 这不是一个理想属性。 在这里, 我们力求在保存更多的语音信息后实现更好的内容嵌入。 我们提出一个有条件的 DSVAE 模式, 允许内容偏差作为前一个模型的条件, 并重塑从后方分布中嵌入的内容。 在 VCTK 数据集的实验中, 我们发现, 由有条件的 DSVAEE 嵌入的内容将会克服随机性, 并实现更具有竞争力的DVA- VAE 级的语音和VAVA 比较的语音。