We propose a cross-modal transformer-based neural correction models that refines the output of an automatic speech recognition (ASR) system so as to exclude ASR errors. Generally, neural correction models are composed of encoder-decoder networks, which can directly model sequence-to-sequence mapping problems. The most successful method is to use both input speech and its ASR output text as the input contexts for the encoder-decoder networks. However, the conventional method cannot take into account the relationships between these two different modal inputs because the input contexts are separately encoded for each modal. To effectively leverage the correlated information between the two different modal inputs, our proposed models encode two different contexts jointly on the basis of cross-modal self-attention using a transformer. We expect that cross-modal self-attention can effectively capture the relationships between two different modals for refining ASR hypotheses. We also introduce a shallow fusion technique to efficiently integrate the first-pass ASR model and our proposed neural correction model. Experiments on Japanese natural language ASR tasks demonstrated that our proposed models achieve better ASR performance than conventional neural correction models.
翻译:我们建议一种基于跨模式变压器的神经校正模型,改进自动语音识别系统(ASR)的输出,从而排除ASR错误。一般来说,神经校正模型由编码器-解码器网络组成,可以直接模拟序列到序列的绘图问题。最成功的方法是使用输入语音及其ASR输出文本作为编码器-解码器网络的输入背景。然而,常规方法不能考虑到这两个不同的模式输入输入输入输入器之间的关系,因为输入环境对每个模式分别进行了编码。为了有效地利用两种不同的模式输入之间的关联信息,我们提议的模型在使用变压器的跨模式自我注意的基础上将两种不同的环境共同编码。我们期望,交叉模式自我注意能够有效地捕捉两种不同的模型之间的关系,用于改进ASR假设。我们还采用了一种浅的融合技术,以便有效地将第一流的ASR模型和我们提议的神经校正模型结合起来。在日本的自然语言上实验ASR任务表明,我们提议的模型比常规的校正模型更好地实现了ASR性。