This paper proposes a human-in-the-loop speaker-adaptation method for multi-speaker text-to-speech. With a conventional speaker-adaptation method, a target speaker's embedding vector is extracted from his/her reference speech using a speaker encoder trained on a speaker-discriminative task. However, this method cannot obtain an embedding vector for the target speaker when the reference speech is unavailable. Our method is based on a human-in-the-loop optimization framework, which incorporates a user to explore the speaker-embedding space to find the target speaker's embedding. The proposed method uses a sequential line search algorithm that repeatedly asks a user to select a point on a line segment in the embedding space. To efficiently choose the best speech sample from multiple stimuli, we also developed a system in which a user can switch between multiple speakers' voices for each phoneme while looping an utterance. Experimental results indicate that the proposed method can achieve comparable performance to the conventional one in objective and subjective evaluations even if reference speech is not used as the input of a speaker encoder directly.
翻译:本文为多发式语音扩音器的文本调适提出了一种多发式扬声器扩音器扩音器调适方法。 使用传统的扬声器调适方法, 目标扬声器的嵌入矢量从参考演讲中提取, 使用经过语言偏差任务培训的扬声器编码器。 但是, 在没有引用演讲时, 无法为目标扬声器获得嵌入矢量。 我们的方法基于一个“ 人对讲音器优化框架 ” 。 该方法包含一个用户探索扬声器组合空间以查找目标扬声器的嵌入。 提议的方法使用一条顺序搜索算法, 反复要求用户在嵌入空间的线段上选择一个点。 为了从多发式调中有效地选择最佳的语音样本, 我们还开发了一个用户可以在每个电话网段的多发声器之间转换, 并循环发音。 实验结果显示, 拟议的方法可以在客观和主观评价中实现与常规的类似性表现, 即使引用演讲器不直接用作演讲器的输入。