Self-supervised learning (SSL) in the pretraining stage using un-annotated speech data has been successful in low-resource automatic speech recognition (ASR) tasks. However, models trained through SSL are biased to the pretraining data which is usually different from the data used in finetuning tasks, causing a domain shifting problem, and thus resulting in limited knowledge transfer. We propose a novel framework, domain responsible adaptation and finetuning (DRAFT), to reduce domain shifting in pretrained speech models through an additional adaptation stage. In DRAFT, residual adapters (RAs) are inserted in the pretrained model to learn domain-related information with the same SSL loss as the pretraining stage. Only RA parameters are updated during the adaptation stage. DRAFT is agnostic to the type of SSL method used and is evaluated with three widely used approaches: APC, Wav2vec2.0, and HuBERT. On two child ASR tasks (OGI and MyST databases), using SSL models trained with un-annotated adult speech data (Librispeech), relative WER improvements of up to 19.7% are observed when compared to the pretrained models without adaptation. Additional experiments examined the potential of cross knowledge transfer between the two datasets and the results are promising, showing a broader usage of the proposed DRAFT framework.
翻译:在培训前阶段,使用未经附加说明的语音数据自我监督学习(SSL)在低资源自动语音识别(ASR)任务中取得了成功,然而,通过SSL培训的模型偏向于培训前数据,而培训前数据通常不同于用于微调任务的数据,造成领域转移问题,从而导致有限的知识转让。我们提议了一个新的框架,即域负责任调整和微调(DRAFT),以通过额外的适应阶段,减少预先培训的语音模型的域变换。在起草过程中,残余适应者(RAs)被插入预先培训的模型,以学习与培训前阶段相同的SLS损失有关的域信息。在适应阶段,只有RA参数得到更新。草案对所用SSL方法的类型具有不确定性,并以三种广泛使用的方法进行评估:ACP、Wav2vec2.0和HuBERT。关于两个儿童 ASRI和 MyST任务(OST数据库),使用经过未经附加说明的成人语音数据训练的SLSL模型(Librispeech), 相对的WER改进至19.7%,在适应阶段只观察了应用了两个潜在的数据框架,然后又检查了两个可能的模型。