In this work, we propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition, which can \textbf{re-purpose} well-trained English automatic speech recognition (ASR) models to recognize the other languages. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement that, for the first time, empowers model reprogramming on ASR. Specifically, we investigate how to select trainable components (i.e., encoder) of a conformer-based RNN-Transducer, as a frozen pre-trained backbone. Experiments on a seven-language multilingual LibriSpeech speech (MLS) task show that model reprogramming only requires 4.2% (11M out of 270M) to 6.8% (45M out of 660M) of its original trainable parameters from a full ASR model to perform competitive results in a range of 11.9% to 8.1% WER averaged across different languages. In addition, we discover different setups to make large-scale pre-trained ASR succeed in both monolingual and multilingual speech recognition. Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses (e.g., w2v-bert) in terms of lower WER and better training efficiency.
翻译:在这项工作中,我们提出一个新的基于跨语言语音识别神经模型重新编程的参数效率学习框架,这种框架可以提供\ textbf{re-tal-trip-triffer-treaction 语音识别(ASR)模型。我们设计了不同的辅助神经结构,侧重于可学习的预先培训的功能增强。这种结构首次授权在ASR上重新编程模型。具体地说,我们调查如何选择一个符合标准的RNNN-Transducer的可培训组件(即,编码器),作为冻结的预先训练骨干。七种语言LibriSpeech语言语音识别(MLMS)模型实验显示,模型的重新编程仅需要4.2%(270M中的11M)至6.8%(660M的45M)的原始可培训参数,从完整的ASR模型中实现竞争性结果,在不同语言中平均为11.9%至8.1%的WER。此外,我们发现不同的设置,要对ASR进行大规模预先培训之前的低语言语言语音语音语音语音识别,并在单语言和多语言的语音测试中进行更好的自我确认。