End-to-end Automatic Speech Recognition (ASR) models are commonly trained over spoken utterances using optimization methods like Stochastic Gradient Descent (SGD). In distributed settings like Federated Learning, model training requires transmission of gradients over a network. In this work, we design the first method for revealing the identity of the speaker of a training utterance with access only to a gradient. We propose Hessian-Free Gradients Matching, an input reconstruction technique that operates without second derivatives of the loss function (required in prior works), which can be expensive to compute. We show the effectiveness of our method using the DeepSpeech model architecture, demonstrating that it is possible to reveal the speaker's identity with 34% top-1 accuracy (51% top-5 accuracy) on the LibriSpeech dataset. Further, we study the effect of two well-known techniques, Differentially Private SGD and Dropout, on the success of our method. We show that a dropout rate of 0.2 can reduce the speaker identity accuracy to 0% top-1 (0.5% top-5).
翻译:端到端自动语音识别模型(ASR)通常使用优化方法(如Stochastic Gradient Emplement (SGD))来进行口语表达式培训。在联邦学习等分布式环境中,模型培训要求通过网络传输梯度。在这项工作中,我们设计了第一个方法来显示只使用梯度的训练话语发言者的身份。我们建议进行赫萨-自由渐变匹配,这是一种投入重建技术,没有损失函数的第二衍生物(前工作所需),而计算成本可能很高。我们用深语音模型架构展示了我们的方法的有效性,表明在LibriSpeech数据集中能够以34%的顶级-1准确度(最高5个百分点)显示发言者的身份。此外,我们研究两种众所周知的技术(差异私人SGD和辍学)对我们方法的成功的影响。我们发现,0.2的退出率可以将发言者身份准确度降低到顶端-1的0 %(最高为0.5%)。