We describe a large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5 Android smartphone. We employ a quantized Long Short-Term Memory (LSTM) acoustic model trained with connectionist temporal classification (CTC) to directly predict phoneme targets, and further reduce its memory footprint using an SVD-based compression scheme. Additionally, we minimize our memory footprint by using a single language model for both dictation and voice command domains, constructed using Bayesian interpolation. Finally, in order to properly handle device-specific information, such as proper names and other context-dependent information, we inject vocabulary items into the decoder graph and bias the language model on-the-fly. Our system achieves 13.5% word error rate on an open-ended dictation task, running with a median speed that is seven times faster than real-time.
翻译:我们描述一个非常准确、低延时性、但有足够的内存和计算足迹小于实时运行速度的Nexus 5 Android智能手机的大型词汇语音识别系统。 我们使用一个经过连接器时间分类(CTC)培训的量化长期短期内存(LSTM)声学模型,直接预测电话目标,并使用基于 SVD 的压缩方案进一步减少其记忆足迹。 此外,我们通过使用一种单一语言模型,用于听写和语音指令域,使用Bayesian Interpologation 来将记忆足迹最小化。 最后,为了正确处理设备特定信息,例如适当名称和其他基于背景的信息,我们将词汇项目输入解码器图,并偏向语言模型的飞行。我们的系统在开放式听写式听写任务上达到13.5%的字误差率,中位速度比实时快7倍。