Faced with the threat of identity leakage during voice data publishing, users are engaged in a privacy-utility dilemma when enjoying convenient voice services. Existing studies employ direct modification or text-based re-synthesis to de-identify users' voices, but resulting in inconsistent audibility in the presence of human participants. In this paper, we propose a voice de-identification system, which uses adversarial examples to balance the privacy and utility of voice services. Instead of typical additive examples inducing perceivable distortions, we design a novel convolutional adversarial example that modulates perturbations into real-world room impulse responses. Benefit from this, our system could preserve user identity from exposure by Automatic Speaker Identification (ASI) while remaining the voice perceptual quality for non-intrusive de-identification. Moreover, our system learns a compact speaker distribution through a conditional variational auto-encoder to sample diverse target embeddings on demand. Combining diverse target generation and input-specific perturbation construction, our system enables any-to-any identify transformation for adaptive de-identification. Experimental results show that our system could achieve 98% and 79% successful de-identification on mainstream ASIs and commercial systems with an objective Mel cepstral distortion of 4.31dB and a subjective mean opinion score of 4.48.
翻译:面对语音数据发布过程中身份泄露的威胁,用户在享有方便语音服务时陷入隐私-私利困境; 现有研究采用直接修改或基于文本的重新合成合成方法,不识别用户的声音,但导致在有人类参与者在场的情况下出现不一致的旁听能力; 在本文件中,我们建议采用声音脱身份系统,使用对抗性实例来平衡隐私和声音服务的利用; 我们设计了一个典型的添加性实例,以引起可察觉的扭曲, 我们设计了一个创新的动态对抗性范例, 将扰动调节成现实世界室的冲动反应。 从中得益, 我们的系统可以保存用户身份, 避免自动语音识别(ASI)接触用户的声音, 同时保持非侵扰性身份识别的感知质量。 此外, 我们的系统通过有条件的变换式自动编码来学习压缩语员分布, 以抽样不同的目标嵌入需求。 将不同的目标生成和具体投入的扰动构建结合起来, 我们的系统能够通过任何到任何识别的变换, 适应性去认同。 实验性结果显示, 我们的系统可以实现98%和79%的Mel3级化。