Voice Conversion (VC) must be achieved while maintaining the content of the source speech and representing the characteristics of the target speaker. The existing methods do not simultaneously satisfy the above two aspects of VC, and their conversion outputs suffer from a trade-off problem between maintaining source contents and target characteristics. In this study, we propose Triple Adaptive Attention Normalization VC (TriAAN-VC), comprising an encoder-decoder and an attention-based adaptive normalization block, that can be applied to non-parallel any-to-any VC. The proposed adaptive normalization block extracts target speaker representations and achieves conversion while minimizing the loss of the source content with siamese loss. We evaluated TriAAN-VC on the VCTK dataset in terms of the maintenance of the source content and target speaker similarity. Experimental results for one-shot VC suggest that TriAAN-VC achieves state-of-the-art performance while mitigating the trade-off problem encountered in the existing VC methods.
翻译:语音转换需要在保留源语音内容的同时表示目标说话者特征。现有方法不能同时满足语音转换的这两个方面,它们的转换输出在维护源内容和目标特征之间存在权衡问题。在本研究中,我们提出了三重自适应注意力归一化语音转换(TriAAN-VC),包括编码器-解码器和基于注意力的自适应归一化块,可以应用于非并行任意语音转换。所提出的自适应归一化块提取目标说话者表示并在最小化同构损失的同时实现转换。我们在VCTK数据集上评估TriAAN-VC,评估指标包括源内容的保持和目标说话者的相似度。 单次语音转换的实验结果表明,TriAAN-VC在缓解现有语音转换方法遇到的权衡问题的同时,实现了最先进的性能。