Recent advances in End-to-End (E2E) Spoken Language Understanding (SLU) have been primarily due to effective pretraining of speech representations. One such pretraining paradigm is the distillation of semantic knowledge from state-of-the-art text-based models like BERT to speech encoder neural networks. This work is a step towards doing the same in a much more efficient and fine-grained manner where we align speech embeddings and BERT embeddings on a token-by-token basis. We introduce a simple yet novel technique that uses a cross-modal attention mechanism to extract token-level contextual embeddings from a speech encoder such that these can be directly compared and aligned with BERT based contextual embeddings. This alignment is performed using a novel tokenwise contrastive loss. Fine-tuning such a pretrained model to perform intent recognition using speech directly yields state-of-the-art performance on two widely used SLU datasets. Our model improves further when fine-tuned with additional regularization using SpecAugment especially when speech is noisy, giving an absolute improvement as high as 8% over previous results.
翻译:最近,端到端(E2E)语言理解(SLU)的进展主要是由于对演讲演示进行有效的预先培训。这种培训前的范式之一是从BERT等最先进的基于文本的模型中提炼出语义知识,从BERT等最先进的模型中提炼出,到语音编码神经网络。这项工作是朝着以更高效和细微的微小方式做同样工作迈出的一步,即我们把演讲嵌入和BERT嵌入在象征性地逐个地基础上。我们引入了一种简单而新颖的技术,利用跨式关注机制从一个语音编码器中提取符号级背景嵌入,以便直接与基于 BERT的嵌入器进行对比和匹配。这种调整是使用新颖的象征性对比性损失进行的。对这种预先训练的模型进行精细调整,以便用语音直接产生在两种广泛使用的 SLUS数据集上的状态。我们的模式在用SpecAug 进行进一步调整时会进一步改进,特别是当语音震动时,使绝对的改进达到高于先前8 %的最高结果。