Spoken language understanding (SLU) systems extract both text transcripts and semantics associated with intents and slots from input speech utterances. SLU systems usually consist of (1) an automatic speech recognition (ASR) module, (2) an interface module that exposes relevant outputs from ASR, and (3) a natural language understanding (NLU) module. Interfaces in SLU systems carry information on text transcriptions or richer information like neural embeddings from ASR to NLU. In this paper, we study how interfaces affect joint-training for spoken language understanding. Most notably, we obtain the state-of-the-art results on the publicly available 50-hr SLURP dataset. We first leverage large-size pretrained ASR and NLU models that are connected by a text interface, and then jointly train both models via a sequence loss function. For scenarios where pretrained models are not utilized, the best results are obtained through a joint sequence loss training using richer neural interfaces. Finally, we show the overall diminishing impact of leveraging pretrained models with increased training data size.
翻译:语言理解(SLU)系统从输入语音话语句中提取文字记录誊本和语义,从输入语音话语句中提取文字记录和语义。 SLU系统通常包括:(1) 自动语音识别(ASR)模块,(2) 暴露ASR相关产出的界面模块,(3) 自然语言理解(NLU)模块。 SLU系统中的界面包含文本记录或更丰富信息的信息,如从ASR到NLU的神经嵌入。我们在本文件中研究界面如何影响口语理解的联合培训。最显著的是,我们获得了公开提供的50-hr SLURP数据集的最新结果。我们首先利用了通过文本界面连接的大型预先培训的ASR和NLU模型,然后通过序列丢失功能联合培训两种模型。对于未使用预先培训模型的情形,通过使用更丰富的神经界面进行联合序列损失培训,取得最佳结果。最后,我们展示了利用预先培训模型和增加培训数据大小的总体减少影响。