End-to-end models have achieved impressive results on the task of automatic speech recognition (ASR). For low-resource ASR tasks, however, labeled data can hardly satisfy the demand of end-to-end models. Self-supervised acoustic pre-training has already shown its amazing ASR performance, while the transcription is still inadequate for language modeling in end-to-end models. In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model. The fused model only needs to learn the transfer from speech to language during fine-tuning on limited labeled data. The length of the two modalities is matched by a monotonic attention mechanism without additional parameters. Besides, a fully connected layer is introduced for the hidden mapping between modalities. We further propose a scheduled fine-tuning strategy to preserve and utilize the text context modeling ability of the pre-trained linguistic encoder. Experiments show our effective utilizing of pre-trained modules. Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.
翻译:终端到终端模型在自动语音识别任务方面已经取得了令人印象深刻的成果。然而,对于低资源语言识别任务,标签数据无法满足终端到终端模型的需求。自监督的声学预培训已经展示出其惊人的ASR性能,而转录还不足以在终端到终端模型中进行语言建模。在这项工作中,我们将预先培训的声学编码器(wav2vec2.0)和预先培训的语言编码器(BERT)合并成一个终端到终端的ASR模型。在对有限标签数据进行微调时,装配的模型只需要学习从语言到语言的转换。两种模式的长度配有单一的注意机制,而没有附加参数。此外,为各种模式之间的隐藏绘图引入了一个完全相连的层。我们进一步提出了一个有计划的微调战略,以保存和利用预先培训的语言编码器的文本建模能力。实验显示了我们如何有效地利用预先培训的模块。我们的模型在AMECON模型上取得了更好的识别性表现。