We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU), where a streaming automatic speech recognition (ASR) model produces the first-pass hypothesis and a second-pass natural language understanding (NLU) component generates the semantic parse by conditioning on both ASR's text and audio embeddings. By formulating E2E SLU as a generalized decoder, our system is able to support complex compositional semantic structures. Furthermore, the sharing of parameters between ASR and NLU makes the system especially suitable for resource-constrained (on-device) environments; our proposed approach consistently outperforms strong pipeline NLU baselines by 0.60% to 0.65% on the spoken version of the TOPv2 dataset (STOP). We demonstrate that the fusion of text and audio features, coupled with the system's ability to rewrite the first-pass hypothesis, makes our approach more robust to ASR errors. Finally, we show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training, but more work is required to make text-to-speech (TTS) a viable solution for scaling up E2E SLU.
翻译:我们提议对端到端(E2E)口语理解(SLU)采取新的基于审议的新办法,即流传自动语音识别(ASR)模式产生第一通假设和第二通自然语言理解(NLU)组成部分,通过对 ASR 文本和音频嵌入的附加条件生成语义分析。通过将E2E SLU 编成一个普遍的解码器,我们的系统能够支持复杂的构成语义结构。此外,ASR和NLU之间的共享参数使得系统特别适合资源限制(在设计上)环境;我们提议的方法始终比TOPv2 数据集(STOP) 口述版本的NLU线基线高出0.60%至0.65%。我们证明,文本和音频特征的融合,加上系统重写第一通的假设的能力,使我们对ASR错误的处理方法更加有力。最后,我们表明,我们的方法可以大大减少从自然语音培训到合成语音培训时的退化,但我们提议的方法在TU2 逐步升级的EPT(S-L) 方面需要更可行的办法。