Speech-to-text errors made by automatic speech recognition (ASR) system negatively impact downstream models relying on ASR transcriptions. Language error correction models as a post-processing text editing approach have been recently developed for refining the source sentences. However, efficient models for correcting errors in ASR transcriptions that meet the low latency requirements of industrial grade production systems have not been well studied. In this work, we propose a novel non-autoregressive (NAR) error correction approach to improve the transcription quality by reducing word error rate (WER) and achieve robust performance across different upstream ASR systems. Our approach augments the text encoding of the Transformer model with a phoneme encoder that embeds pronunciation information. The representations from phoneme encoder and text encoder are combined via multi-modal fusion before feeding into the length tagging predictor for predicting target sequence lengths. The joint encoders also provide inputs to the attention mechanism in the NAR decoder. We experiment on 3 open-source ASR systems with varying speech-to-text transcription quality and their erroneous transcriptions on 2 public English corpus datasets. Results show that our PATCorrect (Phoneme Augmented Transformer for ASR error Correction) consistently outperforms state-of-the-art NAR error correction method on English corpus across different upstream ASR systems. For example, PATCorrect achieves 11.62% WER reduction (WERR) averaged on 3 ASR systems compared to 9.46% WERR achieved by other method using text only modality and also achieves an inference latency comparable to other NAR models at tens of millisecond scale, especially on GPU hardware, while still being 4.2 - 6.7x times faster than autoregressive models on Common Voice and LibriSpeech datasets.
翻译:通过自动语音识别(ASR)系统,自动语音识别(ASR)系统造成的语音对文字错误错误对依赖ASR校正的下游模型产生消极影响。最近为完善源句,开发了语言错误更正模型,作为后处理文本编辑方法,以完善源句句。然而,尚未对符合工业品级生产系统低延度要求的ASR校正错误的高效模型进行充分研究。在这项工作中,我们提出了一个新的非自动(NAR)错误校正方法,通过降低单词错误率(WER)来改进校正质量,并实现不同上游ASR系统的强性功能。我们的方法是将变换模型的文本编码添加成一个包含读音信息的语音编辑器编码。在输入用于预测目标序列长度的长标记预测器之前,通过多调调频级(NAR)系统也为NARS 解码系统的关注机制提供投入。我们在3个开放源ASR系统上用不同的语音对文本质量进行实验,在2个公共AGARSAS系统上也用错误的普通读取方法。