The advances in attention-based encoder-decoder (AED) networks have brought great progress to end-to-end (E2E) automatic speech recognition (ASR). One way to further improve the performance of AED-based E2E ASR is to introduce an extra text encoder for leveraging extensive text data and thus capture more context-aware linguistic information. However, this approach brings a mismatch problem between the speech encoder and the text encoder due to the different units used for modeling. In this paper, we propose an embedding aligner and modality switch training to better align the speech and text latent spaces. The embedding aligner is a shared linear projection between text encoder and speech encoder trained by masked language modeling (MLM) loss and connectionist temporal classification (CTC), respectively. The modality switch training randomly swaps speech and text embeddings based on the forced alignment result to learn a joint representation space. Experimental results show that our proposed approach achieves a relative 14% to 19% word error rate (WER) reduction on Librispeech ASR task. We further verify its effectiveness on spoken language understanding (SLU), i.e., an absolute 2.5% to 2.8% F1 score improvement on SNIPS slot filling task.
翻译:关注的编码器-编码器(AED)网络的进步给终端到终端(E2E)自动语音识别(ASR)带来了巨大的进展。进一步提高基于AED的 E2E ASR 功能的一个方法是引入一个额外的文本编码器,以利用广泛的文本数据,从而获取更符合背景的语言信息。然而,由于用于建模的不同单位,这一方法在语音编码器和文本编码器之间造成了不匹配的问题。在本文中,我们建议采用嵌入式索引和模式转换培训,以更好地对语音和文本潜在空间进行匹配。嵌入式索引是分别由隐蔽语言模型(MLMM)损失和连接时间分类(CTC)培训的文本编码器和语音编码器之间共享的线性预测。模式转换培训根据强制校正结果随机交换语音和文本嵌入一个空间。实验结果显示,我们拟议的方法在Librispeech ASR任务上实现了相对14%至19%的字差率降低率。我们进一步核实了它的有效性,在SLibis Ar 的SLE1 上,在SLO% 级任务上进行了绝对的升级。