In the last decade of automatic speech recognition (ASR) research, the introduction of deep learning brought considerable reductions in word error rate of more than 50% relative, compared to modeling without deep learning. In the wake of this transition, a number of all-neural ASR architectures were introduced. These so-called end-to-end (E2E) models provide highly integrated, completely neural ASR models, which rely strongly on general machine learning knowledge, learn more consistently from data, while depending less on ASR domain-specific experience. The success and enthusiastic adoption of deep learning accompanied by more generic model architectures lead to E2E models now becoming the prominent ASR approach. The goal of this survey is to provide a taxonomy of E2E ASR models and corresponding improvements, and to discuss their properties and their relation to the classical hidden Markov model (HMM) based ASR architecture. All relevant aspects of E2E ASR are covered in this work: modeling, training, decoding, and external language model integration, accompanied by discussions of performance and deployment opportunities, as well as an outlook into potential future developments.
翻译:在过去十年的自动语音识别(ASR)研究中,采用深层次的学习使字差率大大降低,相对而言,与不进行深层学习的模型相比,字差率减少了50%以上;在这种过渡之后,引入了一些全自然的ASR结构;这些所谓的端到端模型提供了高度一体化的完全神经ASR模型,这些模型非常依赖一般的机器学习知识,更一致地从数据中学习,而较少依赖ASR的域别经验;在更通用的模型结构下,成功和热情地采用深层次的学习,导致E2E模型现在成为突出的ASR方法;这项调查的目标是提供E2E ASR模型的分类和相应的改进,并讨论其特性及其与基于ASR的经典隐蔽的Markov模型的关系;这项工作涉及E2E ASR的所有相关方面:建模、培训、解码和外部语言模型整合,同时讨论业绩和部署机会,以及展望未来可能的发展。</s>