We present BERT-CTC-Transducer (BECTRA), a novel end-to-end automatic speech recognition (E2E-ASR) model formulated by the transducer with a BERT-enhanced encoder. Integrating a large-scale pre-trained language model (LM) into E2E-ASR has been actively studied, aiming to utilize versatile linguistic knowledge for generating accurate text. One crucial factor that makes this integration challenging lies in the vocabulary mismatch; the vocabulary constructed for a pre-trained LM is generally too large for E2E-ASR training and is likely to have a mismatch against a target ASR domain. To overcome such an issue, we propose BECTRA, an extended version of our previous BERT-CTC, that realizes BERT-based E2E-ASR using a vocabulary of interest. BECTRA is a transducer-based model, which adopts BERT-CTC for its encoder and trains an ASR-specific decoder using a vocabulary suitable for a target task. With the combination of the transducer and BERT-CTC, we also propose a novel inference algorithm for taking advantage of both autoregressive and non-autoregressive decoding. Experimental results on several ASR tasks, varying in amounts of data, speaking styles, and languages, demonstrate that BECTRA outperforms BERT-CTC by effectively dealing with the vocabulary mismatch while exploiting BERT knowledge.
翻译:我们提出了BECTRA,它是一种新颖的端到端自动语音识别(E2E-ASR)模型,采用转录器和BERT增强编码器。将大规模预训练语言模型(LM)整合到E2E-ASR中一直是活跃研究的热点,旨在利用多功能语言知识生成准确的文本。这种整合面临的一个关键因素在于词汇不匹配;为预训练LM构建的词汇一般对于E2E-ASR培训来说过大,并且很可能与目标ASR领域存在不匹配。为了克服这种问题,我们提出了BECTRA,这是我们以前的BERT-CTC的扩展版本,它使用感兴趣的词汇实现了基于BERT的E2E-ASR。BECTRA是一种基于转录器的模型,它采用BERT-CTC作为编码器,并使用适合目标任务的词汇训练ASR特定的解码器。通过将转录器与BERT-CTC相结合,我们还提出了一种新的推理算法,以发挥自回归和非自回归解码的优势。在几个ASR任务上进行的实验结果,包括数据量,说话风格和语言的不同,在有效处理词汇不匹配的同时利用BERT知识方面,BECTRA优于BERT-CTC。