Connectionist temporal classification (CTC) -based models are attractive in automatic speech recognition (ASR) because of their non-autoregressive nature. To take advantage of text-only data, language model (LM) integration approaches such as rescoring and shallow fusion have been widely used for CTC. However, they lose CTC's non-autoregressive nature because of the need for beam search, which slows down the inference speed. In this study, we propose an error correction method with phone-conditioned masked LM (PC-MLM). In the proposed method, less confident word tokens in a greedy decoded output from CTC are masked. PC-MLM then predicts these masked word tokens given unmasked words and phones supplementally predicted from CTC. We further extend it to Deletable PC-MLM in order to address insertion errors. Since both CTC and PC-MLM are non-autoregressive models, the method enables fast LM integration. Experimental evaluations on the Corpus of Spontaneous Japanese (CSJ) and TED-LIUM2 in domain adaptation setting shows that our proposed method outperformed rescoring and shallow fusion in terms of inference speed, and also in terms of recognition accuracy on CSJ.
翻译:在自动语音识别(ASR)中,基于连接时间分类(CTC)的模型具有吸引力,因为其性质不具有外向性质。为了利用只使用文本的数据,对CTC广泛使用了语言模型(LM)整合方法,如重新校准和浅相融合等语言模型(LM),但是,由于需要光束搜索,从而减缓推导速度,它们失去了CTC的不偏向性性质。在本研究中,我们建议了一种使用手机固定遮蔽式LM(PC-MMM)的错误纠正方法。在拟议方法中,对来自CTC的贪婪解码产出中较不自信的单词符号加以遮掩。PC-MLMM随后预测了这些未包装的单词和从CT那里补充预测的电话等隐蔽的单词符号。我们进一步将其扩展至可脱色的PC-MLMM,以解决插入错误。由于CT和PC-MMM都是非显性模型,所以该方法能够快速LM整合。在SBJC和TED-LUM2的域调控解输出的实验性标值中,在地面适应中也显示了我们拟议的方法的深度识别的精确度。