Voice conversion for highly expressive speech is challenging. Current approaches struggle with the balancing between speaker similarity, intelligibility and expressiveness. To address this problem, we propose Expressive-VC, a novel end-to-end voice conversion framework that leverages advantages from both neural bottleneck feature (BNF) approach and information perturbation approach. Specifically, we use a BNF encoder and a Perturbed-Wav encoder to form a content extractor to learn linguistic and para-linguistic features respectively, where BNFs come from a robust pre-trained ASR model and the perturbed wave becomes speaker-irrelevant after signal perturbation. We further fuse the linguistic and para-linguistic features through an attention mechanism, where speaker-dependent prosody features are adopted as the attention query, which result from a prosody encoder with target speaker embedding and normalized pitch and energy of source speech as input. Finally the decoder consumes the integrated features and the speaker-dependent prosody feature to generate the converted speech. Experiments demonstrate that Expressive-VC is superior to several state-of-the-art systems, achieving both high expressiveness captured from the source speech and high speaker similarity with the target speaker; meanwhile intelligibility is well maintained.
翻译:高显性语音转换具有挑战性。 目前的方法与声音相似性、可知性和直观性之间的平衡相争。 为了解决这一问题,我们提出“ 表达- VC ”,这是一个新的端对端语音转换框架,它利用神经瓶颈特征(BNF)方法和信息扰动方法的优势。 具体地说,我们使用BNF编码器和 Perturbed-Wav 编码器组成内容提取器,分别学习语言和准语言特征,BNF来自强有力的预先训练的ASR模型,而周遭的波在信号渗透后变得与语音相关。 我们进一步通过关注机制整合语言和准语言特征,在关注机制中采用依赖语器的侧写特征作为关注问题查询方法。 我们使用目标发言者嵌入和正统化的音调和源语力作为输入。 最后,解调器消耗了综合特征和依赖语器的侧写性特征以生成转换的演讲。 实验表明,Exprive-VC具有高压性,其表达性与高压性与高压性,同时实现了高压。