In this paper, we investigate how the output representation of an end-to-end neural network affects multilingual automatic speech recognition (ASR). We study different representations including character-level, byte-level, byte pair encoding (BPE), and byte-level byte pair encoding (BBPE) representations, and analyze their strengths and weaknesses. We focus on developing a single end-to-end model to support utterance-based bilingual ASR, where speakers do not alternate between two languages in a single utterance but may change languages across utterances. We conduct our experiments on English and Mandarin dictation tasks, and we find that BBPE with penalty schemes can improve utterance-based bilingual ASR performance by 2% to 5% relative even with smaller number of outputs and fewer parameters. We conclude with analysis that indicates directions for further improving multilingual ASR.
翻译:在本文中,我们调查端到端神经网络的产出表示方式如何影响多语种自动语音识别(ASR)。我们研究不同的表达方式,包括字符级、字节、字节对数编码(BPE)和字节字节对数编码(BBPE)表达方式,并分析其优缺点。我们侧重于开发一个单一端到端模式来支持基于语的双语ASR,在这种模式中,讲者不会在两种语言之间用一个词来交替,但可以在整个语句中改变语言。我们在英语和普通话教义任务上进行了实验,我们发现采用惩罚办法的BBBBE可以将基于语的双语ASR表现提高2%至5%,即使产出较少,参数也更少。我们通过分析得出结论,指出进一步改进多语种ASR的方向。