Factorizing speech as disentangled speech representations is vital to achieve highly controllable style transfer in voice conversion (VC). Conventional speech representation learning methods in VC only factorize speech as speaker and content, lacking controllability on other prosody-related factors. State-of-the-art speech representation learning methods for more speechfactors are using primary disentangle algorithms such as random resampling and ad-hoc bottleneck layer size adjustment,which however is hard to ensure robust speech representationdisentanglement. To increase the robustness of highly controllable style transfer on multiple factors in VC, we propose a disentangled speech representation learning framework based on adversarial learning. Four speech representations characterizing content, timbre, rhythm and pitch are extracted, and further disentangled by an adversarial Mask-And-Predict (MAP)network inspired by BERT. The adversarial network is used tominimize the correlations between the speech representations,by randomly masking and predicting one of the representationsfrom the others. Experimental results show that the proposedframework significantly improves the robustness of VC on multiple factors by increasing the speech quality MOS from 2.79 to3.30 and decreasing the MCD from 3.89 to 3.58.
翻译:在语音转换(VC)中,传统语音代表学习方法仅将语言作为语言和内容的表达要素,缺乏控制性,对其它亲善相关因素缺乏控制。 最先进的语音代表学习方法正在使用主要分解算法,如随机抽取和临时混合瓶颈层尺寸调整,但很难确保声音代表的强力分解。为了在声音转换(VC)中提高高度可控风格转换的稳健性,我们提议基于对抗性学习而建立一个分解的语音代表学习框架。 四个以内容、节奏、节奏和音为特点的语音代表方法被抽出,并由BERT所启发的对抗性Mac-And-Predict(MAP)网络进一步分解。 对抗性网络被用来通过随机遮掩蔽和预测其他表达方式之一,将语音代表的关联性最小化。 实验结果表明,拟议的框架大大改进了VC-S的稳健性,从2.89质量提高到3.89MC。