碎片式VC:通过尾端至尾端提取和爆裂精细的碎裂语音碎片并引起注意,将任何声音转换成任何声音 (FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and Fusing Fine-Grained Voice Fragments With Attention)

Any-to-any voice conversion aims to convert the voice from and to any speakers even unseen during training, which is much more challenging compared to one-to-one or many-to-many tasks, but much more attractive in real-world scenarios. In this paper we proposed FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0, while the spectral features of the utterance(s) from the target speaker are obtained from log mel-spectrograms. By aligning the hidden structures of the two different feature spaces with a two-stage training process, FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance, all based on the attention mechanism of Transformer as verified with analysis on attention maps, and is accomplished end-to-end. This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information and doesn't require parallel data. Objective evaluation based on speaker verification and subjective evaluation with MOS both showed that this approach outperformed SOTA approaches, such as AdaIN-VC and AutoVC.

翻译：任何声音转换到任何声音,都是为了在培训期间将声音转换成任何声音,甚至任何声音都看不见,这比一对一或多个任务或多个任务更具挑战性,但在现实世界情景中更具吸引力。在本文中,我们提议了分形VC,其中源演讲者发声的潜在语音结构取自Wav2Vec 2.0,而目标演讲者发声的光谱特征取自日志Mel-spectrogram。通过将两个不同功能空间的隐藏结构与两阶段培训进程相匹配,碎片VC能够从目标演讲者发声中提取精细微的语音碎片并将其结合到预期的发音中,所有这些都以变异器的注意机制为基础,通过对关注地图的分析加以核实,完成端对端。这一方法仅经过重建损失培训,而没有考虑到内容和发言者信息之间的任何分解因素,不需要平行的数据。基于发言者核实和主观评价的客观评价,如MOS-C和AutoVTA-S-TA的做法显示这种方法。

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

“CVPR 2021 接受论文列表 1663篇论文都在这了

专知会员服务

32+阅读 · 2021年6月12日

字节跳动李航提出AMBERT！超越BERT！多粒度token预训练语言模型

专知会员服务

41+阅读 · 2020年8月31日

【Google】多模态Transformer视频检索，Multi-modal Transformer

专知会员服务

103+阅读 · 2020年7月22日

【ACL2020-亚马逊】Transformers多分辨率和多模态语音识别，Multiresolution and Multimodal Speech Recognition with Transformers

专知会员服务

15+阅读 · 2020年5月5日