Recently, the speech community is seeing a significant trend of moving from deep neural network based hybrid modeling to end-to-end (E2E) modeling for automatic speech recognition (ASR). While E2E models achieve the state-of-the-art results in most benchmarks in terms of ASR accuracy, hybrid models are still used in a large proportion of commercial ASR systems at the current time. There are lots of practical factors that affect the production model deployment decision. Traditional hybrid models, being optimized for production for decades, are usually good at these factors. Without providing excellent solutions to all these factors, it is hard for E2E models to be widely commercialized. In this paper, we will overview the recent advances in E2E models, focusing on technologies addressing those challenges from the industry's perspective.
翻译:最近,演说界正在看到一个重大趋势,即从基于深层神经网络的混合模型到终端至终端(E2E)自动语音识别模型(ASR),从E2E模型在大多数ASR精确度基准方面都取得了最新成果,但目前大部分商业的ASR系统仍然使用混合模型,影响生产模型部署决定的许多实际因素。传统的混合模型在生产上已经优化了几十年,通常在这些因素方面很行得通。如果不为所有这些因素提供极好的解决方案,E2E模型很难被广泛商业化。本文将概述E2E模型的最新进展,重点从行业角度探讨应对这些挑战的技术。