Streaming end-to-end speech recognition models have been widely applied to mobile devices and show significant improvement in efficiency. These models are typically trained on the server using transcribed speech data. However, the server data distribution can be very different from the data distribution on user devices, which could affect the model performance. There are two main challenges for on device training, limited reliable labels and limited training memory. While self-supervised learning algorithms can mitigate the mismatch between domains using unlabeled data, they are not applicable on mobile devices directly because of the memory constraint. In this paper, we propose an incremental layer-wise self-supervised learning algorithm for efficient speech domain adaptation on mobile devices, in which only one layer is updated at a time. Extensive experimental results demonstrate that the proposed algorithm obtains a Word Error Rate (WER) on the target domain $24.2\%$ better than supervised baseline and costs $89.7\%$ less training memory than the end-to-end self-supervised learning algorithm.
翻译:分流端到端语音识别模型已广泛应用于移动设备,并显示出效率的显著提高。 这些模型通常在服务器上使用转录语音数据进行培训。 但是,服务器数据发布与用户设备的数据分配可能大不相同,这可能影响模型性能。 在设备培训、有限可靠标签和培训记忆有限方面,存在两大挑战。 虽然自监督学习算法可以减少使用未贴标签数据的域间的不匹配,但由于内存限制,它们不能直接适用于移动设备。 在本文中,我们建议为移动设备的有效语音域适应采用一种递增层自监督的自监督学习算法,其中一次只更新一个层。广泛的实验结果显示,拟议的算法在目标领域获得的单词错误率24.2 美元比监督基线好,费用为89.7 美元,比终端到终端自监督学习算法少89.7 美元。