Self-supervised speech models learn representations that capture both content and speaker information. Yet this entanglement creates problems: content tasks suffer from speaker bias, and privacy concerns arise when speaker identity leaks through supposedly anonymized representations. We present two contributions to address these challenges. First, we develop InterpTRQE-SptME (Timbre Residual Quantitative Evaluation Benchmark of Speech pre-training Models Encoding via Interpretability), a benchmark that directly measures residual speaker information in content embeddings using SHAP-based interpretability analysis. Unlike existing indirect metrics, our approach quantifies the exact proportion of speaker information remaining after disentanglement. Second, we propose InterpTF-SptME, which uses these interpretability insights to filter speaker information from embeddings. Testing on VCTK with seven models including HuBERT, WavLM, and ContentVec, we find that SHAP Noise filtering reduces speaker residuals from 18.05% to nearly zero while maintaining recognition accuracy (CTC loss increase under 1%). The method is model-agnostic and requires no retraining.
翻译:自监督语音模型学习到的表征同时编码了内容信息与说话人信息。然而,这种信息纠缠会引发问题:内容任务易受说话人偏差影响,且当说话人身份通过本应匿名的表征泄露时,会引发隐私担忧。本文提出两项贡献以应对这些挑战。首先,我们构建了InterpTRQE-SptME(基于可解释性的语音预训练模型编码音色残差量化评估基准),该基准利用基于SHAP的可解释性分析,直接度量内容嵌入中残留的说话人信息。与现有间接评估指标不同,我们的方法能够量化解耦后残留说话人信息的精确比例。其次,我们提出InterpTF-SptME方法,利用可解释性分析结果从嵌入中过滤说话人信息。通过在VCTK数据集上对包括HuBERT、WavLM和ContentVec在内的七个模型进行测试,我们发现SHAP噪声过滤能将说话人残差从18.05%降至接近零,同时保持识别准确率(CTC损失增长低于1%)。该方法具有模型无关性,且无需重新训练。