非平行多对多种语音转换生成网络 (Subband-based Generative Adversarial Network for Non-parallel Many-to-many Voice Conversion)

Voice conversion is to generate a new speech with the source content and a target voice style. In this paper, we focus on one general setting, i.e., non-parallel many-to-many voice conversion, which is close to the real-world scenario. As the name implies, non-parallel many-to-many voice conversion does not require the paired source and reference speeches and can be applied to arbitrary voice transfer. In recent years, Generative Adversarial Networks (GANs) and other techniques such as Conditional Variational Autoencoders (CVAEs) have made considerable progress in this field. However, due to the sophistication of voice conversion, the style similarity of the converted speech is still unsatisfactory. Inspired by the inherent structure of mel-spectrogram, we propose a new voice conversion framework, i.e., Subband-based Generative Adversarial Network for Voice Conversion (SGAN-VC). SGAN-VC converts each subband content of the source speech separately by explicitly utilizing the spatial characteristics between different subbands. SGAN-VC contains one style encoder, one content encoder, and one decoder. In particular, the style encoder network is designed to learn style codes for different subbands of the target speaker. The content encoder network can capture the content information on the source speech. Finally, the decoder generates particular subband content. In addition, we propose a pitch-shift module to fine-tune the pitch of the source speaker, making the converted tone more accurate and explainable. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art performance on VCTK Corpus and AISHELL3 datasets both qualitatively and quantitatively, whether on seen or unseen data. Furthermore, the content intelligibility of SGAN-VC on unseen data even exceeds that of StarGANv2-VC with ASR network assistance.

翻译：VV2 语音转换是生成含有源内容和目标语音风格的新语音。在本文中, 我们侧重于一个常规设置, 即非平行的多到多声音转换, 接近于真实世界的情景。正如名称所暗示的那样, 非平行的多到多声音转换不需要配对源和参考演讲, 并且可以应用到任意的语音传输。近年来, 生成反动网络( GANs) 和其他技术, 如 Conditional Vatarial Autencoders( CVAE) 已经在这个领域取得了相当的进展。然而, 由于声音转换的精密性, 转换的语音转换方式仍然不尽如实。由于Mel- probel- man- 语音转换的内在结构, 我们提出了一个新的语音转换框架, e. e., 以 e. bandbandband- adversarial 网络用于语音转换(SGAN- VC) 的。 SGAN- VC 将每个源的子频道内容转换为不同的源, 通过明确使用空间样式格式的 Enal Stysteal distration Ede comde comde ende real deder comde dededede comde comde de decredeal decreal comde comde compeal compeal de dreal compals.

相关内容

Networking

关注 22

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

【CIKM2019 Tutorial】Recent Developments of Deep Heterogeneous Information Network Analysis（深度异构信息网络分析的最新进展），附157页PDF免费下载

专知会员服务

29+阅读 · 2019年11月3日