Self-supervised pre-training could effectively improve the performance of low-resource automatic speech recognition (ASR). However, existing self-supervised pre-training are task-agnostic, i.e., could be applied to various downstream tasks. Although it enlarges the scope of its application, the capacity of the pre-trained model is not fully utilized for the ASR task, and the learned representations may not be optimal for ASR. In this work, in order to build a better pre-trained model for low-resource ASR, we propose a pre-training approach called wav2vec-S, where we use task-specific semi-supervised pre-training to refine the self-supervised pre-trained model for the ASR task thus more effectively utilize the capacity of the pre-trained model to generate task-specific representations for ASR. Experiments show that compared to wav2vec 2.0, wav2vec-S only requires a marginal increment of pre-training time but could significantly improve ASR performance on in-domain, cross-domain and cross-lingual datasets. Average relative WER reductions are 24.5% and 6.6% for 1h and 10h fine-tuning, respectively. Furthermore, we show that semi-supervised pre-training could close the representation gap between the self-supervised pre-trained model and the corresponding fine-tuned model through canonical correlation analysis.
翻译:在这项工作中,为了为低资源自动语音识别建立一个更好的预先培训模式,我们建议了一种称为 wav2vec-S的培训前方法,即我们使用特定任务半监督前培训前改进自动监督前培训模式,以完善自动监督前培训后培训模式,从而更有效地利用预先培训模式的能力为ASR提供具体任务的表述。 实验显示,与wav2vec 2.0相比, wav2vec-S仅需要略加增加培训前培训前时间,而且可以大幅提高ASR在内部、跨部和跨部培训前培训前的绩效,以完善自动监督后培训前培训模式,从而更有效地利用预先培训模式的能力,为ASR提供具体任务说明。 实验显示,与wav2vec 2.0相比, wav2vec-S仅需要略加培训前培训前时间,但可以大幅提高ASR在跨部和跨部培训前的绩效,我们使用特定任务半监督前培训前培训前改进后培训前改进了ASR的任务模式,从而更有效地利用了ASR的事先培训模式,从而更有效地利用了对ASR进行自我监督前分析。