This paper presents a sequence-to-sequence voice conversion (S2S-VC) algorithm which allows to preserve some aspects of the source speaker during conversion, typically its prosody, which is useful in many real-life application of voice conversion. In S2S-VC, the decoder is usually conditioned on linguistic and speaker embeddings only, with the consequence that only the linguistic content is actually preserved during conversion. In the proposed S2S-VC architecture, the decoder is conditioned explicitly on the desired F0 sequence so that the converted speech has the same F0 as the one of the source speaker, or any F0 defined arbitrarily. Moreover, an adversarial module is further employed so that the S2S-VC is not only optimized on the available true speech samples, but can also take efficiently advantage of the converted speech samples that can be produced by using various conditioning such as speaker identity, F0, or timing.
翻译:本文介绍了一种顺序到顺序的语音转换算法(S2S-VC),这种算法允许在转换过程中保留源演讲者的某些方面,通常是其手动状态,这在许多语音转换的实际应用中非常有用。在S2S-VC中,解码器通常仅以语言和语音嵌入为条件,其结果是只有语言内容在转换过程中得到实际保存。在拟议的S2S-VC结构中,解码器明确以理想的F0序列为条件,使转换的语音与源演讲者之一具有相同的F0,或任意定义的任何F0。此外,还进一步使用了对抗性模块,以使S2S-VC不仅在现有的真实语音样本上得到优化,而且还能够有效地利用转换的语音样本,这些样本可以使用诸如语音身份、F0或时间等各种条件生成。