We propose an end-to-end trained spoken language understanding (SLU) system that extracts transcripts, intents and slots from an input speech utterance. It consists of a streaming recurrent neural network transducer (RNNT) based automatic speech recognition (ASR) model connected to a neural natural language understanding (NLU) model through a neural interface. This interface allows for end-to-end training using multi-task RNNT and NLU losses. Additionally, we introduce semantic sequence loss training for the joint RNNT-NLU system that allows direct optimization of non-differentiable SLU metrics. This end-to-end SLU model paradigm can leverage state-of-the-art advancements and pretrained models in both ASR and NLU research communities, outperforming recently proposed direct speech-to-semantics models, and conventional pipelined ASR and NLU systems. We show that this method improves both ASR and NLU metrics on both public SLU datasets and large proprietary datasets.
翻译:我们建议采用经过培训的端到端口语理解系统,从输入语音语句中提取记录誊本、意向和空档,包括流经常性神经网络传输器(RNNT)基于神经自然语言理解模型的自动语音识别模型(ASR),通过神经界面连接神经自然语言理解模型(NLU),通过多任务RNNT和NLU损失进行端到端培训。此外,我们为RNNT-NLU联合系统引入语义序列损失培训,以便直接优化不可区分的 SLU指标。这种终端到终端SLU模式可以利用ASR和NLU研究界的状态艺术进步和预先培训模型,超过最近提出的直接语音对立模型以及传统管道的ASR和NLU系统。我们表明,这一方法改善了公共SLU数据集和大型专有数据集的ASR和NLU指标。