Word-piece models (WPMs) are commonly used subword units in state-of-the-art end-to-end automatic speech recognition (ASR) systems. For multilingual ASR, due to the differences in written scripts across languages, multilingual WPMs bring the challenges of having overly large output layers and scaling to more languages. In this work, we propose a universal monolingual output layer (UML) to address such problems. Instead of one output node for only one WPM, UML re-associates each output node with multiple WPMs, one for each language, and results in a smaller monolingual output layer shared across languages. Consequently, the UML enables to switch in the interpretation of each output node depending on the language of the input speech. Experimental results on an 11-language voice search task demonstrated the feasibility of using UML for high-quality and high-efficiency multilingual streaming ASR.
翻译:在最先进的端到端自动语音识别系统中,单字模型(WWPMs)是常用的子词单位;对于多语种自动语音识别系统(ASR),多语种自动语音识别系统(ASR)由于不同语言书面文字的差异,多语种的WPM系统带来了产出层过大和向更多语言扩展的挑战;在这项工作中,我们建议一个通用的单一语言输出层(UML)来解决这些问题;对于一个WPM系统,UML重新组合每个输出节点,每个输出节点都有一个输出节点,每个输出节点都有多个WPM(每个语言一个),结果形成一个小的单一语言输出层,各语言共享。因此,根据投入演讲的语言,多语种可以转换每个输出节的翻译。一个11种语言语音搜索任务的实验结果表明,使用UML(UM)来高质量和高效益多语种流ASR的可行性。