In this paper, we review various end-to-end automatic speech recognition algorithms and their optimization techniques for on-device applications. Conventional speech recognition systems comprise a large number of discrete components such as an acoustic model, a language model, a pronunciation model, a text-normalizer, an inverse-text normalizer, a decoder based on a Weighted Finite State Transducer (WFST), and so on. To obtain sufficiently high speech recognition accuracy with such conventional speech recognition systems, a very large language model (up to 100 GB) is usually needed. Hence, the corresponding WFST size becomes enormous, which prohibits their on-device implementation. Recently, fully neural network end-to-end speech recognition algorithms have been proposed. Examples include speech recognition systems based on Connectionist Temporal Classification (CTC), Recurrent Neural Network Transducer (RNN-T), Attention-based Encoder-Decoder models (AED), Monotonic Chunk-wise Attention (MoChA), transformer-based speech recognition systems, and so on. These fully neural network-based systems require much smaller memory footprints compared to conventional algorithms, therefore their on-device implementation has become feasible. In this paper, we review such end-to-end speech recognition models. We extensively discuss their structures, performance, and advantages compared to conventional algorithms.
翻译:在本文中,我们审查了各种端到端自动语音识别算法及其在设备上应用的优化技术。常规语音识别系统包括大量离散组件,如声学模型、语言模型、发音模型、发音模型、文本调整器、反文本归正器、基于精密国家传输器(WFST)的解码器等等。为了在常规语音识别系统中获得足够高的语音识别精度,通常需要一种非常大的语言模型(最高达100GB ) 。因此,相应的WFST规模变得巨大,禁止其在设备上实施。最近,提出了完全神经网络终端到终端语音识别算法。例子包括基于连接温度分类(CTC)、常规神经网络转换器(RNNNE-T)、基于注意的Encoder-Decoder模型(AEDED)、Monotonic Chunk Wy(MoCHA)、基于变换语言识别系统的语音识别系统。这些基于完全神经网络的系统的规模巨大,因此,我们需要在常规语音分析模型上进行更小规模的图像分析。