In recent years, neural models learned through self-supervised pretraining on large scale multilingual text or speech data have exhibited promising results for underresourced languages, especially when a relatively large amount of data from related language(s) is available. While the technology has a potential for facilitating tasks carried out in language documentation projects, such as speech transcription, pretraining a multilingual model from scratch for every new language would be highly impractical. We investigate the possibility for adapting an existing multilingual wav2vec 2.0 model for a new language, focusing on actual fieldwork data from a critically endangered tongue: Ainu. Specifically, we (i) examine the feasibility of leveraging data from similar languages also in fine-tuning; (ii) verify whether the model's performance can be improved by further pretraining on target language data. Our results show that continued pretraining is the most effective method to adapt a wav2vec 2.0 model for a new language and leads to considerable reduction in error rates. Furthermore, we find that if a model pretrained on a related speech variety or an unrelated language with similar phonological characteristics is available, multilingual fine-tuning using additional data from that language can have positive impact on speech recognition performance when there is very little labeled data in the target language.
翻译:近年来,通过对大规模多语种文本或语言数据进行自我监督的预先培训而获得的神经模型,在资源不足的语文方面,特别是在相关语文提供的数据数量相对较多的情况下,已经取得了大有希望的成果;虽然这一技术有可能促进语言文件项目(如语音转录)中执行的任务,但从零开始对每种新语文的多语种模型进行从零开始的培训将是非常不切实际的;我们研究是否有可能为一种新语言修改现有的多语种 wav2vec 2.0模式,侧重于来自严重濒危舌头(Ainu)的实际实地工作数据。具体地说,我们(i)研究利用类似语言的数据进行微调的可行性;(ii)核实是否可以通过对目标语言数据进行进一步的初步培训来改进该模型的性能。我们的结果表明,继续培训是修改新语言Wav2vec 2.0模式的最有效方法,并导致错误率的大幅降低。此外,我们发现,如果在相关语言种类或不相关的语言具有类似声调特点的模范语言方面,那么使用该语言的额外数据进行多语种微调,那么该语言在语音数据标签上对语音表现影响很小。