Masked language model (MLM) has been widely used for understanding tasks, e.g. BERT. Recently, MLM has also been used for generation tasks. The most popular one in speech is using Mask-CTC for non-autoregressive speech recognition. In this paper, we take one step further, and explore the possibility of using MLM as a non-autoregressive spell correction (SC) model for transformer-transducer (TT), denoted as MLM-SC. Our initial experiments show that MLM-SC provides no improvements on Librispeech data. The problem might be the choice of modeling units (word pieces) and the inaccuracy of the TT confidence scores for English data. To solve the problem, we propose a mask sample decoding (MS-decode) method where the masked tokens can have the choice of being masked or not to compensate for the inaccuracy. As a result, we reduce the WER of a streaming TT from 7.6% to 6.5% on the Librispeech test-other data and the CER from 7.3% to 6.1% on the Aishell test data, respectively.
翻译:隐藏语言模型(MLMM) 已被广泛用于理解任务, 例如 BERT 。 最近, MLMM 也被广泛用于生成任务 。 语言中最受欢迎的是使用 Mask- CT 来进行非侵略性语音识别 。 在本文中, 我们进一步探索使用 MLM 作为变压器- 传送器( TT) 的非侵略性修正模型( SC) 的可能性, 被称为 MLM- SC 。 我们最初的实验显示 MLM- SC 无法改进 Librispeech 数据 。 问题可能是模型单位的选择( 字片) 和 TT 信任分数对英语数据的不准确性。 为了解决问题, 我们建议使用一个掩码样本解码( MS- decodecodecode) 方法, 使掩码标记可以选择遮蔽或不补偿不准确性。 因此, 我们把流出 TT 的 WER 从 Librispeech 测试数据和 CER 分别从7.3% 减少到 6.5 % 。