Variational autoencoder-based voice conversion (VAE-VC) has the advantage of requiring only pairs of speeches and speaker labels for training. Unlike the majority of the research in VAE-VC which focuses on utilizing auxiliary losses or discretizing latent variables, this paper investigates how an increasing model expressiveness has benefits and impacts on the VAE-VC. Specifically, we first analyze VAE-VC from a rate-distortion perspective, and point out that model expressiveness is significant for VAE-VC because rate and distortion reflect similarity and naturalness of converted speeches. Based on the analysis, we propose a novel VC method using a deep hierarchical VAE, which has high model expressiveness as well as having fast conversion speed thanks to its non-autoregressive decoder. Also, our analysis reveals another problem that similarity can be degraded when the latent variable of VAEs has redundant information. We address the problem by controlling the information contained in the latent variable using $\beta$-VAE objective. In the experiment using VCTK corpus, the proposed method achieved mean opinion scores higher than 3.5 on both naturalness and similarity in inter-gender settings, which are higher than the scores of existing autoencoder-based VC methods.
翻译:与侧重于利用辅助损失或分散潜伏变量的VAE-VC的大多数研究不同,本文调查了日益增强的模型表达性如何对VAE-VC产生益处和影响。 具体地说,我们首先从比例扭曲角度分析VAE-VC,指出模型表达性对VAE-VC很重要,因为速度和扭曲反映了转换的演讲的相似性和自然性。根据分析,我们提出一种新型VC方法,采用高等级VAE, 该方法具有较高的模型表达性,并由于其非显性脱色作用而具有快速转换速度。此外,我们的分析还揭示了另一个问题,即当VAE的潜伏变量有多余的信息时,类似性可能会退化。我们通过使用美元和本代元VAE-VAE目标控制潜伏变量所含的信息,解决了这一问题。 在使用VCTC-VAE-C的类似度、比现有平均分级法在VCT-C中达到的比目前平均分级更高分级法的类似性方法的实验中,在VCT-C中,在VC-C-campposial-dealalalal-detionalphal-dealal-procal-palizional-pal-pal-pal pasion-palmental-pal-palizionalizional-pal-pal-pal-pal-pal-palization-pal-pal-pal-pal-pal-palizal-palmentalization-dation-palation-pal-dementality-de-de-de-dement)。