We introduce a novel method to improve the performance of the VoicePrivacy Challenge 2022 baseline B1 variants. Among the known deficiencies of x-vector-based anonymization systems is the insufficient disentangling of the input features. In particular, the fundamental frequency (F0) trajectories, which are used for voice synthesis without any modifications. Especially in cross-gender conversion, this situation causes unnatural sounding voices, increases word error rates (WERs), and personal information leakage. Our submission overcomes this problem by synthesizing an F0 trajectory, which better harmonizes with the anonymized x-vector. We utilized a low-complexity deep neural network to estimate an appropriate F0 value per frame, using the linguistic content from the bottleneck features (BN) and the anonymized x-vector. Our approach results in a significantly improved anonymization system and increased naturalness of the synthesized voice. Consequently, our results suggest that F0 extraction is not required for voice anonymization.
翻译:我们引入了一种创新方法来改进2022年语音探索挑战基线B1变体的性能。在已知的基于x矢量的匿名系统缺陷中,未充分分解输入特征。特别是用于语音合成而没有任何修改的基本频率(F0)轨迹。特别是在跨性别转换方面,这种情况导致非自然声音的探测、增加单词错误率和个人信息泄漏。我们的呈文通过合成一个F0轨迹克服了这一问题,F0轨迹与匿名化x-矢量系统更加一致。我们利用一个低兼容深度神经网络来估计每个框架的适当F0值,使用来自瓶颈特征(BN)和匿名化x-Victor的语言内容。我们的方法导致一个显著改进的匿名系统以及合成声音的自然性增强。因此,我们的呈文结果表明,声音匿名不需要F0提取。