Error correction techniques remain effective to refine outputs from automatic speech recognition (ASR) models. Existing end-to-end error correction methods based on an encoder-decoder architecture process all tokens in the decoding phase, creating undesirable latency. In this paper, we propose an ASR error correction method utilizing the predictions of correction operations. More specifically, we construct a predictor between the encoder and the decoder to learn if a token should be kept ("K"), deleted ("D"), or changed ("C") to restrict decoding to only part of the input sequence embeddings (the "C" tokens) for fast inference. Experiments on three public datasets demonstrate the effectiveness of the proposed approach in reducing the latency of the decoding process in ASR correction. It enhances the inference speed by at least three times (3.4 and 5.7 times) while maintaining the same level of accuracy (with WER reductions of 0.53% and 1.69% respectively) for our two proposed models compared to a solid encoder-decoder baseline. In the meantime, we produce and release a benchmark dataset contributing to the ASR error correction community to foster research along this line.
翻译:现有端到端错误校正方法, 仅限输入序列嵌入部分( “ C” 符号), 用于快速推断。 对三个公共数据集的实验表明, 拟议的方法在降低 ASR 校正进程解码过程的延缩方面的有效性。 同时, 我们制作并发布一个基准数据校正, 以便至少提高三次( 3.4 和 5.7 次) 的推导速度, 同时保持我们两个拟议模型的精确度( WER 分别减少0. 53% 和 1.69%), 与固态编码解码器群落基线相比, 我们制作并发布一条基准数据校正。