In order to protect the privacy of speech data, speaker anonymization aims for hiding the identity of a speaker by changing the voice in speech recordings. This typically comes with a privacy-utility trade-off between protection of individuals and usability of the data for downstream applications. One of the challenges in this context is to create non-existent voices that sound as natural as possible. In this work, we propose to tackle this issue by generating speaker embeddings using a generative adversarial network with Wasserstein distance as cost function. By incorporating these artificial embeddings into a speech-to-text-to-speech pipeline, we outperform previous approaches in terms of privacy and utility. According to standard objective metrics and human evaluation, our approach generates intelligible and content-preserving yet privacy-protecting versions of the original recordings.
翻译:为了保护语音数据的隐私,发言者匿名的目的是通过改变语音录音中的声音来隐藏发言者的身份,这通常与个人保护与下游应用数据的可用性之间的隐私效用权衡有关,这方面的挑战之一是创造尽可能自然的、不存在的声音。在这项工作中,我们提议通过利用瓦塞尔斯坦距离作为成本功能的基因对抗网络来培养发言者,解决这一问题。通过将这些人工嵌入语音到语音到语音到语音的管道,我们在隐私和实用性方面优于以往的做法。根据标准的客观指标和人类评价,我们的方法产生了原始录音的易懂和内容保护的版本。