This paper introduces voice reenactement as the task of voice conversion (VC) in which the expressivity of the source speaker is preserved during conversion while the identity of a target speaker is transferred. To do so, an original neural- VC architecture is proposed based on sequence-to-sequence voice conversion (S2S-VC) in which the speech prosody of the source speaker is preserved during conversion. First, the S2S-VC architecture is modified so as to synchronize the converted speech with the source speech by mean of phonetic duration encoding; second, the decoder is conditioned on the desired sequence of F0- values and an explicit F0-loss is formulated between the F0 of the source speaker and the one of the converted speech. Besides, an adversarial learning of conversions is integrated within the S2S-VC architecture so as to exploit both advantages of reconstruction of original speech and converted speech with manipulated attributes during training and then reducing the inconsistency between training and conversion. An experimental evaluation on the VCTK speech database shows that the speech prosody can be efficiently preserved during conversion, and that the proposed adversarial learning consistently improves the conversion and the naturalness of the reenacted speech.
翻译:本文介绍语音变换任务(VC) 语音变换任务(VC) 。 变换期间源演讲者的表达力在转换期间得到保存, 目标演讲者的身份被转移。 为此, 根据顺序到顺序变换声音(S2S- VC) 提出原始神经变换结构(S2S- VC), 在转换期间保留源演讲者的语音动作。 首先, S2S- VC 结构被修改, 以便用语音编码将转换的语音与源的语音同步起来; 其次, 变换发言的解码以预期的F0- 值序列为条件, 并且源演讲者的F0 和变换的语音之一之间将作出明确的F0- 损失。 此外, S2S- VC 变换的对转换的对抗性学习被纳入S-2S- VC 结构, 以便利用在培训期间重建原言词和用被操纵的属性转换语音的好处, 从而减少培训与转换之间的不一致性。 VCTK 语音数据库的实验性评估显示, 转换期间可以有效地保存演讲的语音变换换, 以及拟议的对抗性 。