Recently self-supervised learning has emerged as an effective approach to improve the performance of automatic speech recognition (ASR). Under such a framework, the neural network is usually pre-trained with massive unlabeled data and then fine-tuned with limited labeled data. However, the non-streaming architecture like bidirectional transformer is usually adopted by the neural network to achieve competitive results, which can not be used in streaming scenarios. In this paper, we mainly focus on improving the performance of streaming transformer under the self-supervised learning framework. Specifically, we propose a novel two-stage training method during fine-tuning, which combines knowledge distilling and self-training. The proposed training method achieves 16.3% relative word error rate (WER) reduction on Librispeech noisy test set. Finally, by only using the 100h clean subset of Librispeech as the labeled data and the rest (860h) as the unlabeled data, our streaming transformer based model obtains competitive WERs 3.5/8.7 on Librispeech clean/noisy test sets.
翻译:最近自我监督的学习已成为改进自动语音识别(ASR)工作的一种有效方法。在这个框架内,神经网络通常先用大量未贴标签的数据进行预先培训,然后用有限的标签数据进行微调。然而,神经网络通常采用双向变压器等非流结构,以取得竞争性结果,而这种结果不能用于流态情景。在本文中,我们主要侧重于改进在自监督学习框架内流动变压器的性能。具体地说,我们提议在微调期间采用新型的两阶段培训方法,将知识蒸馏和自我培训结合起来。拟议培训方法在Librispeech噪音测试集上实现了16.3%相对字差率(WER)的降低。最后,我们流动变压式模型仅使用Librispeech 100h清洁的一组数据作为无标签数据,其余部分(860h)作为无标签数据,在Lirispeech清洁/nois测试组获得竞争性WERs 3.58/7。