ASR systems designed for native English (L1) usually underperform on non-native English (L2). To address this performance gap, \textbf{(i)} we extend our previous work to investigate fine-tuning of a pre-trained wav2vec 2.0 model \cite{baevski2020wav2vec,xu2021self} under a rich set of L1 and L2 training conditions. We further \textbf{(ii)} incorporate language model decoding in the ASR system, along with the fine-tuning method. Quantifying gains acquired from each of these two approaches separately and an error analysis allows us to identify different sources of improvement within our models. We find that while the large self-trained wav2vec 2.0 may be internalizing sufficient decoding knowledge for clean L1 speech \cite{xu2021self}, this does not hold for L2 speech and accounts for the utility of employing language model decoding on L2 data.
翻译:为本地英语设计的ASR系统(L1)通常在非本地英语上表现不佳(L2),为了解决这一性能差距,我们扩展了以前的工作,以调查在丰富的L1和L2培训条件下对预先培训的 wav2vec 2.0 模型进行微调的情况。我们进一步将语言模式解码纳入ASR系统,同时采用微调方法。分别对从这两种方法中获得的收益进行量化,并进行错误分析,使我们能够在模型中找出不同的改进来源。我们发现,大型自培训的 wav2vec 2.0 模型可能已经将清洁L1语言的足够解码知识内部化,但对于L2语言模式解码的实用性来说,L1 语音和账户没有保留。