While large language models (LLM) have made impressive progress in natural language processing, it remains unclear how to utilize them in improving automatic speech recognition (ASR). In this work, we propose to train a single multilingual language model (LM) for shallow fusion in multiple languages. We push the limits of the multilingual LM to cover up to 84 languages by scaling up using a mixture-of-experts LLM, i.e., generalist language model (GLaM). When the number of experts increases, GLaM dynamically selects only two at each decoding step to keep the inference computation roughly constant. We then apply GLaM to a multilingual shallow fusion task based on a state-of-the-art end-to-end model. Compared to a dense LM of similar computation during inference, GLaM reduces the WER of an English long-tail test set by 4.4% relative. In a multilingual shallow fusion task, GLaM improves 41 out of 50 languages with an average relative WER reduction of 3.85%, and a maximum reduction of 10%. Compared to the baseline model, GLaM achieves an average WER reduction of 5.53% over 43 languages.
翻译:虽然大型语言模型(LLM)在自然语言处理方面取得了令人印象深刻的进展,但如何利用这些模型改进自动语音识别(ASR)仍不清楚。 在这项工作中,我们提议对单一的多语种语言模型(LM)进行培训,以便以多种语言进行浅质融合。我们通过使用专家混合的LLM(即通用语言模型(GLAM)),将多语种LM的限度提高到84种语言。在专家混合的LM(即通用语言模型(GLAM))的推广下,将多语种LM(LM)的限度提高到了多达84种语言。在多语种浅语言模型中,GLaM(GLAM)在每个解码步骤中只选择了两个,以保持推算的大致不变。我们随后将GLAM(GLAM)应用到一个基于最先进的端到端模式的多语言的多语种浅点融合模式(LM)中。相比,GLAM(GLAM)将英语长尾试验的WER(W553)平均削减率4.4%。在50种语言中,GLAM(WER)比基线模型减少10%。