The widespread of powerful personal devices capable of collecting voice of their users has opened the opportunity to build speaker adapted speech recognition system (ASR) or to participate to collaborative learning of ASR. In both cases, personalized acoustic models (AM), i.e. fine-tuned AM with specific speaker data, can be built. A question that naturally arises is whether the dissemination of personalized acoustic models can leak personal information. In this paper, we show that it is possible to retrieve the gender of the speaker, but also his identity, by just exploiting the weight matrix changes of a neural acoustic model locally adapted to this speaker. Incidentally we observe phenomena that may be useful towards explainability of deep neural networks in the context of speech processing. Gender can be identified almost surely using only the first layers and speaker verification performs well when using middle-up layers. Our experimental study on the TED-LIUM 3 dataset with HMM/TDNN models shows an accuracy of 95% for gender detection, and an Equal Error Rate of 9.07% for a speaker verification task by only exploiting the weights from personalized models that could be exchanged instead of user data.
翻译:能够收集用户声音的强大的个人设备十分广泛,这为建立扩音语音识别系统(ASR)或参与ASR的协作学习提供了机会。在这两种情况下,都可以建立个性化声学模型(AM),即配有特定音频数据的微调AM。自然产生的一个问题是,传播个性化声学模型是否会泄露个人信息。在本文中,我们显示,只要利用当地适应该发言者的神经声学模型的重量矩阵变化,就有可能检索出该发言者的性别,也有可能检索到他的身份。我们顺便看到一些现象,这些现象可能有助于在语音处理过程中解释深层神经网络。几乎可以肯定地确定性别,在使用中间层时,仅使用第一个层进行声频核查。我们对TED-LIUM 3数据集的实验研究表明,与HMM/TDNNN模型的精确度为95%,而且仅通过利用个人化模型的重量而不是用户数据来进行同等误差率9.07%。