Recent research in speech processing exhibits a growing interest in unsupervised and self-supervised representation learning from unlabelled data to alleviate the need for large amounts of annotated data. We investigate several popular pre-training methods and apply them to Flemish Dutch. We compare off-the-shelf English pre-trained models to models trained on an increasing amount of Flemish data. We find that the most important factors for positive transfer to downstream speech recognition tasks include a substantial amount of data and a matching pre-training domain. Ideally, we also finetune on an annotated subset in the target language. All pre-trained models improve linear phone separability in Flemish, but not all methods improve Automatic Speech Recognition. We experience superior performance with wav2vec 2.0 and we obtain a 30% WER improvement by finetuning the multilingually pre-trained XLSR-53 model on Flemish Dutch, after integration into an HMM-DNN acoustic model.
翻译:最近对语言处理的研究显示,人们越来越关心从未加标签的数据中进行不受监督和自我监督的代表学习,以缓解对大量附加说明的数据的需求。我们调查了几种受欢迎的培训前方法,并将其应用于佛兰德荷兰语。我们比较了现成的英语预先培训模式和关于越来越多的佛兰德语数据的培训模式。我们发现,向下游语言识别任务进行积极转让的最重要因素包括大量数据和匹配的培训前领域。理想的情况是,我们还精细细分析目标语言中附加说明的子集。所有预先培训的模型都改善了佛兰德语线性电话的分离性,但并不是所有方法都改善了自动语音识别性。我们用 wav2vec 2.0 取得了优异的成绩,我们通过微调佛兰德荷兰语的多语言预先培训的 XLSR-53 模式,在融入了HMM-DNN音音响模型之后,我们获得了30%的WER改进。