Substantial improvements have been achieved in recent years in voice conversion, which converts the speaker characteristics of an utterance into those of another speaker without changing the linguistic content of the utterance. Nonetheless, the improved conversion technologies also led to concerns about privacy and authentication. It thus becomes highly desired to be able to prevent one's voice from being improperly utilized with such voice conversion technologies. This is why we report in this paper the first known attempt to perform adversarial attack on voice conversion. We introduce human imperceptible noise into the utterances of a speaker whose voice is to be defended. Given these adversarial examples, voice conversion models cannot convert other utterances so as to sound like being produced by the defended speaker. Preliminary experiments were conducted on two currently state-of-the-art zero-shot voice conversion models. Objective and subjective evaluation results in both white-box and black-box scenarios are reported. It was shown that the speaker characteristics of the converted utterances were made obviously different from those of the defended speaker, while the adversarial examples of the defended speaker are not distinguishable from the authentic utterances.
翻译:近些年来,在声音转换方面已经取得了重大改进,将声音转换的发言者特点转换成另一个发言者的发言者特点,而不改变语言内容;然而,改良的转换技术也引起了对隐私和认证的关切,因此,人们非常希望能够防止声音转换技术不当使用,因此,我们非常希望能够防止人们的声音被这种声音转换技术不当利用。这就是为什么我们在本文件中报告第一个已知的对声音转换进行对抗性攻击的企图。我们把人类无法察觉的噪音引入一个声音要辩护的发言者的发声中。鉴于这些对抗性的例子,声音转换模型无法将其他声音转换成像被辩护的发言者制作的声音。对两种目前最先进的零速声音转换模型进行了初步试验,报告了白箱和黑箱情景中的客观和主观评价结果。我们发现,转换的语音的发言者特点与被辩护的发言者明显不同,而被辩护的发言者的对抗性例子与真实的说词是无法区分的。