On-device end-to-end (E2E) models have shown improvements over a conventional model on English Voice Search tasks in both quality and latency. E2E models have also shown promising results for multilingual automatic speech recognition (ASR). In this paper, we extend our previous capacity solution to streaming applications and present a streaming multilingual E2E ASR system that runs fully on device with comparable quality and latency to individual monolingual models. To achieve that, we propose an Encoder Endpointer model and an End-of-Utterance (EOU) Joint Layer for a better quality and latency trade-off. Our system is built in a language agnostic manner allowing it to natively support intersentential code switching in real time. To address the feasibility concerns on large models, we conducted on-device profiling and replaced the time consuming LSTM decoder with the recently developed Embedding decoder. With these changes, we managed to run such a system on a mobile device in less than real time.
翻译:设计端到端模式(E2E)相对于传统的英语语音搜索模式在质量和长期性两方面都显示出了改进。E2E模式也显示了多语种自动语音识别(ASR)的可喜结果。在本文件中,我们将我们以前的能力解决方案扩大到流式应用程序,并提出了一个流式的多语种 E2E ASR系统,该系统完全在质量和耐久性相当的装置上运行,与单个单一语言模式完全连接。为了实现这一点,我们提议了一个Encoder 终端模型和一个终端联合层(EOU),以提高质量和耐久性交换。我们的系统是以一种语言的不可知性方式建成的,使其能够在实际时间上支持相互对接的代码转换。为了解决大型模型的可行性问题,我们进行了设计性剖析,用最近开发的嵌入式解码器取代了耗时的LSTM解码器。有了这些变化,我们设法在移动设备上运行了比实时更短的系统。