Federated Learning (FL) enables training state-of-the-art Automatic Speech Recognition (ASR) models on user devices (clients) in distributed systems, hence preventing transmission of raw user data to a central server. A key challenge facing practical adoption of FL for ASR is obtaining ground-truth labels on the clients. Existing approaches rely on clients to manually transcribe their speech, which is impractical for obtaining large training corpora. A promising alternative is using semi-/self-supervised learning approaches to leverage unlabelled user data. To this end, we propose a new Federated ASR method called FedNST for noisy student training of distributed ASR models with private unlabelled user data. We explore various facets of FedNST , such as training models with different proportions of unlabelled and labelled data, and evaluate the proposed approach on 1173 simulated clients. Evaluating FedNST on LibriSpeech, where 960 hours of speech data is split equally into server (labelled) and client (unlabelled) data, showed a 22.5% relative word error rate reduction (WERR) over a supervised baseline trained only on server data.
翻译:联邦学习联合会(FL)能够对分布式系统中的用户设备(客户)进行最先进的自动语音识别(ASR)模型培训,从而防止将原始用户数据传输到中央服务器。实际采用ASRFL所面临的一项关键挑战是如何获得客户的地面真相标签。现有办法依靠客户手工改写其演讲稿,这对于获得大型培训公司来说是不切实际的。一个有希望的替代办法是使用半/自监督的学习方法来利用未贴标签用户数据。为此,我们提议采用新的FedNST方法,用私人无标签用户数据对已分发的ASR模型进行吵闹学生培训。我们探索FedNST的不同方面,例如使用不同比例的无标签和贴标签数据的培训模型,并评价对1 173名模拟客户的拟议办法。在LibriSpeech上对FedNST进行了评估,其中960小时的语音数据均分为服务器(标签)和客户(未贴标签)数据,显示比仅经监督的服务器数据减少22.5%的相对字差率。