Sharing real-world speech utterances is key to the training and deployment of voice-based services. However, it also raises privacy risks as speech contains a wealth of personal data. Speaker anonymization aims to remove speaker information from a speech utterance while leaving its linguistic and prosodic attributes intact. State-of-the-art techniques operate by disentangling the speaker information (represented via a speaker embedding) from these attributes and re-synthesizing speech based on the speaker embedding of another speaker. Prior research in the privacy community has shown that anonymization often provides brittle privacy protection, even less so any provable guarantee. In this work, we show that disentanglement is indeed not perfect: linguistic and prosodic attributes still contain speaker information. We remove speaker information from these attributes by introducing differentially private feature extractors based on an autoencoder and an automatic speech recognizer, respectively, trained using noise layers. We plug these extractors in the state-of-the-art anonymization pipeline and generate, for the first time, private speech utterances with a provable upper bound on the speaker information they contain. We evaluate empirically the privacy and utility resulting from our differentially private speaker anonymization approach on the LibriSpeech data set. Experimental results show that the generated utterances retain very high utility for automatic speech recognition training and inference, while being much better protected against strong adversaries who leverage the full knowledge of the anonymization process to try to infer the speaker identity.
翻译:分享真实世界的语音话语是培训和部署语音服务的关键。 但是,它也增加了隐私风险,因为语音包含大量个人数据。 议长匿名旨在将演讲者信息从演讲语句中去除,同时保留其语言和预言属性完整。 国式技术通过将演讲者信息(通过嵌入一个发言人)与这些属性分离(通过嵌入一个发言人),并用另一个发言人嵌入的音频层来重新合成演讲。 先前在隐私界的研究表明,匿名化往往提供简便的隐私保护,甚至不那么具有可调取的保障。 在这项工作中,我们显示脱钩确实不完美:语言和预言属性仍然包含演讲者信息。 我们通过采用基于自动电解析器和自动语音识别器的不同私人特征提取信息,将演讲者信息从这些特性中去除。 我们把这些提取的节流放在状态的匿名管道中,首次从可调出私语系语音保护隐私,并生成一种可调易懂的高级语音信息。 我们评估了对高级语音信息进行更精确的在线的高级浏览性分析,同时对高级语音数据进行更精确的分析。