In this paper, we propose a language-universal adapter learning framework based on a pre-trained model for end-to-end multilingual automatic speech recognition (ASR). For acoustic modeling, the wav2vec 2.0 pre-trained model is fine-tuned by inserting language-specific and language-universal adapters. An online knowledge distillation is then used to enable the language-universal adapters to learn both language-specific and universal features. The linguistic information confusion is also reduced by leveraging language identifiers (LIDs). With LIDs we perform a position-wise modification on the multi-head attention outputs. In the inference procedure, the language-specific adapters are removed while the language-universal adapters are kept activated. The proposed method improves the recognition accuracy and addresses the linear increase of the number of adapters' parameters with the number of languages in common multilingual ASR systems. Experiments on the BABEL dataset confirm the effectiveness of the proposed framework. Compared to the conventional multilingual model, a 3.3% absolute error rate reduction is achieved. The code is available at: https://github.com/shen9712/UniversalAdapterLearning.
翻译:在本文中,我们提议一个语言通用调适器学习框架,其基础是终端到终端多语言自动语音识别(ASR)的预先培训模式。对于声学模型而言, wav2vec 2.0 预培训模式通过插入语言专用和语言通用的适应器进行微调。然后,使用在线知识蒸馏使语言通用适应器既学习语言特定特点,又学习通用特征。语言信息混乱也通过利用语言标识符(LIDs)来减少。随着LIDs,我们对多头注意力输出进行定位性修改。在推断程序中,语言特定调适器被删除,同时保持语言通用调适器。拟议方法将提高适应器参数的准确度,并解决适应通用多语种ASR系统中语言数目的线性增长问题。BABEL数据库实验证实了拟议框架的有效性。与传统的多语种模型相比,实现了3.3%的绝对误差率降低。代码见: https://github.com/schem9712/UniversalAdapapap。</s>