The widespread adoption of speech-based online services raises security and privacy concerns regarding the data that they use and share. If the data were compromised, attackers could exploit user speech to bypass speaker verification systems or even impersonate users. To mitigate this, we propose DeID-VC, a speaker de-identification system that converts a real speaker to pseudo speakers, thus removing or obfuscating the speaker-dependent attributes from a spoken voice. The key components of DeID-VC include a Variational Autoencoder (VAE) based Pseudo Speaker Generator (PSG) and a voice conversion Autoencoder (AE) under zero-shot settings. With the help of PSG, DeID-VC can assign unique pseudo speakers at speaker level or even at utterance level. Also, two novel learning objectives are added to bridge the gap between training and inference of zero-shot voice conversion. We present our experimental results with word error rate (WER) and equal error rate (EER), along with three subjective metrics to evaluate the generated output of DeID-VC. The result shows that our method substantially improved intelligibility (WER 10% lower) and de-identification effectiveness (EER 5% higher) compared to our baseline. Code and listening demo: https://github.com/a43992899/DeID-VC
翻译:广泛采用基于语音的在线服务,使人们对其使用和共享的数据产生安全和隐私方面的担忧。如果数据被破坏,攻击者可能会利用用户的言论绕过发言者的核查系统,甚至假冒用户。为了减轻这一影响,我们提议DID-VC,即将一个真正的发言者转换成假发言者的代身份识别系统,从而从一个声音中消除或模糊或模糊依赖发言者的特征。DeID-VC的关键组成部分包括一个基于普塞乌多演讲器(VAE)的变音自动自动显示器(VAE)和在零发音环境中转换自动显示的语音转换器。在PSG的帮助下,DeID-VC可以指定一个独特的假发言者,甚至在语音层面上。此外,还增加了两个新的学习目标,以弥合培训与零发音转换的误差和误差率之间的差距。我们用文字错误率(WER)和相等的误差率(ER),以及三个主观指标来评价DeID-VC生成的产出。结果显示,在PSG一级,DER-DE% 并大幅改进了我们的基准/REIG%。