In this paper, we investigate domain adaptation for low-resource Automatic Speech Recognition (ASR) of target-domain data, when a well-trained ASR model trained with a large dataset is available. We argue that in the encoder-decoder framework, the decoder of the well-trained ASR model is largely tuned towards the source-domain, hurting the performance of target-domain models in vanilla transfer-learning. On the other hand, the encoder layers of the well-trained ASR model mostly capture the acoustic characteristics. We, therefore, propose to use the embeddings tapped from these encoder layers as features for a downstream Conformer target-domain model and show that they provide significant improvements. We do ablation studies on which encoder layer is optimal to tap the embeddings, as well as the effect of freezing or updating the well-trained ASR model's encoder layers. We further show that applying Spectral Augmentation (SpecAug) on the proposed features (this is in addition to default SpecAug on input spectral features) provides a further improvement on the target-domain performance. For the LibriSpeech-100-clean data as target-domain and SPGI-5000 as a well-trained model, we get 30% relative improvement over baseline. Similarly, with WSJ data as target-domain and LibriSpeech-960 as a well-trained model, we get 50% relative improvement over baseline.
翻译:在本文中,我们调查了目标域数据中低资源自动语音识别(ASR)的域适应情况,当具备了受过训练的大型数据集的低资源自动语音识别(ASR)数据时,我们调查了目标域数据中的低资源自动语音识别(ASR)的域适应情况。我们争辩说,在经过良好训练的 ASR 模型的解码器框架框架里,经过良好训练的 ASR 模型的解码器基本上调整到源域,从而损害到香草转移学习中接受良好训练的ASR 模型的性能。另一方面,经过良好训练的 ASR 模型的编码器层主要捕捉到声学特性。因此,我们提议使用从这些编码层中挖掘出来的嵌入的ASR 模型作为下游 Consuder 目标域模型的特征,并显示它们提供了显著的改进。我们做了关于哪些编码器模型层最优化嵌入嵌入的源代码-代码-代码模型层,以及冻结或更新经过良好训练的ASR 模型-LI-LIG 的相对性数据性改进。