One-shot voice conversion has received significant attention since only one utterance from source speaker and target speaker respectively is required. Moreover, source speaker and target speaker do not need to be seen during training. However, available one-shot voice conversion approaches are not stable for unseen speakers as the speaker embedding extracted from one utterance of an unseen speaker is not reliable. In this paper, we propose a deep discriminative speaker encoder to extract speaker embedding from one utterance more effectively. Specifically, the speaker encoder first integrates residual network and squeeze-and-excitation network to extract discriminative speaker information in frame level by modeling frame-wise and channel-wise interdependence in features. Then attention mechanism is introduced to further emphasize speaker related information via assigning different weights to frame level speaker information. Finally a statistic pooling layer is used to aggregate weighted frame level speaker information to form utterance level speaker embedding. The experimental results demonstrate that our proposed speaker encoder can improve the robustness of one-shot voice conversion for unseen speakers and outperforms baseline systems in terms of speech quality and speaker similarity.
翻译:由于只要求源演讲人和目标演讲人分别讲出一个话语,一发声音转换工作受到极大关注,因为需要分别从源演讲人和目标演讲人分别讲出一个话语,而且不需要在培训期间看到源演讲人和目标演讲人,但是,由于从一个隐蔽发言者的一句话中摘出的声音嵌入一个声音,对隐蔽的发言者来说,可用的一发声音转换方法并不稳定,因为从一个隐蔽发言者的一句话中摘出的声音转换方法并不可靠。在本文中,我们建议用一个深具歧视性的演讲人编码器,以便更有效地从一个话语句中提取发言者嵌入的声音。具体地说,发言者编码器首先将残余的网络和挤压和刺激网络整合到框架一级,通过在功能上建模框架框架和频道上的相互依存来提取歧视性的演讲人信息。然后引入一个注意机制,通过给演讲人的信息分配不同的权重来进一步强调与演讲人有关的信息。最后,将一个统计集中层用于汇总加权的基底层次的演讲人信息,形成一个发音层的发言人嵌入。实验结果表明,我们提议的演讲人的编码器可以改进对看不见发言者的一发音转换的强度,在发言质量和相似性方面超越基线系统的系统。