Recently, self-supervised pretraining has achieved impressive results in end-to-end (E2E) automatic speech recognition (ASR). However, the dominant sequence-to-sequence (S2S) E2E model is still hard to fully utilize the self-supervised pre-training methods because its decoder is conditioned on acoustic representation thus cannot be pretrained separately. In this paper, we propose a pretrained Transformer (Preformer) S2S ASR architecture based on hybrid CTC/attention E2E models to fully utilize the pretrained acoustic models (AMs) and language models (LMs). In our framework, the encoder is initialized with a pretrained AM (wav2vec2.0). The Preformer leverages CTC as an auxiliary task during training and inference. Furthermore, we design a one-cross decoder (OCD), which relaxes the dependence on acoustic representations so that it can be initialized with pretrained LM (DistilGPT2). Experiments are conducted on the AISHELL-1 corpus and achieve a $4.6\%$ character error rate (CER) on the test set. Compared with our vanilla hybrid CTC/attention Transformer baseline, our proposed CTC/attention-based Preformer yields $27\%$ relative CER reduction. To the best of our knowledge, this is the first work to utilize both pretrained AM and LM in a S2S ASR system.
翻译:最近,自我监督的训练前阶段在终端到终端(E2E)自动语音识别(ASR)方面取得了令人印象深刻的成果。然而,占主导地位的顺序到序列的E2E模型仍然难以充分利用自监督的训练前方法,因为其解码器是以声学代表为条件的,因此无法单独进行预先训练。在本文件中,我们提议以混合的CTC/注意E2E模型为基础,建立一个预先训练的变压器S2S2S ASR结构,以充分利用预先训练的音学模型和语言模型(LMs )。在我们的框架里,编码器以预先训练的AM(wav2vec2.0)为初始,在培训和推断期间,将CTCS作为辅助任务加以充分利用。此外,我们设计了一个一次性的解码器(OCD),这样可以减轻对声学表达器的依赖,这样就可以以预先训练的LM(DettillGP2)。 正在对AISELL-1号和语言模型进行实验,并实现4.6-10美元性字符错误率的ARC-RR 用于我们拟议的IMS-S-R IMSBRBR BOR 测试的第一次基准。