引文:快速和准确的不偏向端对端语音识别平行变换器 (Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition)

Transformers have recently dominated the ASR field. Although able to yield good performance, they involve an autoregressive (AR) decoder to generate tokens one by one, which is computationally inefficient. To speed up inference, non-autoregressive (NAR) methods, e.g. single-step NAR, were designed, to enable parallel generation. However, due to an independence assumption within the output tokens, performance of single-step NAR is inferior to that of AR models, especially with a large-scale corpus. There are two challenges to improving single-step NAR: Firstly to accurately predict the number of output tokens and extract hidden variables; secondly, to enhance modeling of interdependence between output tokens. To tackle both challenges, we propose a fast and accurate parallel transformer, termed Paraformer. This utilizes a continuous integrate-and-fire based predictor to predict the number of tokens and generate hidden variables. A glancing language model (GLM) sampler then generates semantic embeddings to enhance the NAR decoder's ability to model context interdependence. Finally, we design a strategy to generate negative samples for minimum word error rate training to further improve performance. Experiments using the public AISHELL-1, AISHELL-2 benchmark, and an industrial-level 20,000 hour task demonstrate that the proposed Paraformer can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.

翻译：变异器最近占据了 ASR 字段。虽然它能够产生良好的性能, 但它包含一个自动递增解码器, 逐个生成一个符号, 计算效率低。为了加快推断, 非自动递增( NAR) 方法, 比如单步 NAR, 设计了平行生成。但是, 由于在输出符号中有一个独立假设, 单步 NAR 的性能低于AR 模型的性能, 特别是大型的体积。在改进单步 NAR 时, 有两个挑战: 首先准确预测输出符号的数量并提取隐藏变量; 其次, 加强产出符号之间相互依存的建模。为了应对这两个挑战, 我们提出了一个快速和准确的平行变异器, 称为 Paraforect。这使用了基于连续的集成和火的预测器, 单步调式 NARDR( GLM) 样器的性能低于AR 模型, 然后生成语义嵌嵌嵌嵌, 用来加强NAR decoder 的模型在内部模型上进行相互依存的能力。其次, 我们设计了一个可比较性化的性E- LISISAL,, 将一个最低性测试的成绩样本到一个用于最低性工作, 。