We present BERT-CTC-Transducer (BECTRA), a novel end-to-end automatic speech recognition (E2E-ASR) model formulated by the transducer with a BERT-enhanced encoder. Integrating a large-scale pre-trained language model (LM) into E2E-ASR has been actively studied, aiming to utilize versatile linguistic knowledge for generating accurate text. One crucial factor that makes this integration challenging lies in the vocabulary mismatch; the vocabulary constructed for a pre-trained LM is generally too large for E2E-ASR training and is likely to have a mismatch against a target ASR domain. To overcome such an issue, we propose BECTRA, an extended version of our previous BERT-CTC, that realizes BERT-based E2E-ASR using a vocabulary of interest. BECTRA is a transducer-based model, which adopts BERT-CTC for its encoder and trains an ASR-specific decoder using a vocabulary suitable for a target task. With the combination of the transducer and BERT-CTC, we also propose a novel inference algorithm for taking advantage of both autoregressive and non-autoregressive decoding. Experimental results on several ASR tasks, varying in amounts of data, speaking styles, and languages, demonstrate that BECTRA outperforms BERT-CTC by effectively dealing with the vocabulary mismatch while exploiting BERT knowledge.
翻译:我们介绍了BERT-CT-Transer(BECTRA),这是由使用BERT-增强的编码器的转导器开发的新型端到端自动语音识别(E2E-ASR)模型,这是由BERT-增强的编码器开发的,我们积极研究将大规模预培训语言模型(LM)纳入E2E-ASR,目的是利用多种语言知识来生成准确文本。使这种整合具有挑战挑战性的一个关键因素是词汇错配;预先培训LM的词汇对于E2E-ASR培训来说,通常过于庞大,而且可能与目标ASR(E2E2E-ASR)域有不匹配。为了克服这样一个问题,我们建议BECTRA,这是我们先前BERT-CT-CT的扩大版本,利用兴趣词汇表实现基于BERT的E2E-ASR。 BECTRA是一种基于导的模型模型,采用BERT-CT的编码器,并利用适合目标任务的词汇培训ASR具体的解解码器。 结合了ER和BER-ER-Ex-Ex-Ex-Ex-R-C-C-R-C-C-R-C-R-C-R-T-T-R-R-R-T-T-T-T-T-T-R-T-T-T-T-T-T-T-T-T-T-D-D-D-G-G-G-G-G-D-G-G-G-G-G-G-G-G-G-G(我们提议BBBBBBBBBBBBBA、BA、B-B-BA、BA、B-B-B-B-A、B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-A-B-A-A-A-L-L-L-A-A-A-A-A-B-B-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A