End-to-end (E2E) models are becoming increasingly popular for spoken language understanding (SLU) systems and are beginning to achieve competitive performance to pipeline-based approaches. However, recent work has shown that these models struggle to generalize to new phrasings for the same intent indicating that models cannot understand the semantic content of the given utterance. In this work, we incorporated language models pre-trained on unlabeled text data inside E2E-SLU frameworks to build strong semantic representations. Incorporating both semantic and acoustic information can increase the inference time, leading to high latency when deployed for applications like voice assistants. We developed a 2-pass SLU system that makes low latency prediction using acoustic information from the few seconds of the audio in the first pass and makes higher quality prediction in the second pass by combining semantic and acoustic representations. We take inspiration from prior work on 2-pass end-to-end speech recognition systems that attends on both audio and first-pass hypothesis using a deliberation network. The proposed 2-pass SLU system outperforms the acoustic-based SLU model on the Fluent Speech Commands Challenge Set and SLURP dataset and reduces latency, thus improving user experience. Our code and models are publicly available as part of the ESPnet-SLU toolkit.
翻译:终端到终端(E2E)模式越来越受口语理解系统(SLU)的欢迎,并开始在编审方法上取得竞争性性能。然而,最近的工作表明,这些模式努力为同一意图而向新的语法推广,表明模型无法理解给定语句的语义内容。在这项工作中,我们纳入了语言模式,对E2E-SLU框架内的无标签文本数据进行了预先培训,以建立强大的语义表达式。纳入语义和声频信息可以增加推论时间,导致在像语音助理这样的应用程序部署时出现高延缓度。我们开发了2PSLU系统,利用第一传声频几秒钟的音频信息进行低延缓度预测,并通过将语义和语音表达方式相结合,在第二传出时作出更高质量的预测。我们从先前关于2P端到终端语音识别系统的工作中得到启发,该系统使用一个评议网络,可以增加音频和第一流假设,从而在应用语音助理等应用程序时导致高延时间。我们开发了2PSLU系统,用基于声频-SEVSAR SDSL指令的SDSDER模型,从而降低了我们现有的SERS-S-SDSEDS-S-SVSU 和SERSDSDSU 和SVSDSDSDSDSDSDSDS-S-S-SDSDSDSDSDSDSDF 的SDF 和SDFSDFSDFSDFSDFSDFSDFSDF 的SDFSDF 的SDF 和SDFSDFSDSDSDSDSDSDSDSDSDFSDSDSDSDSDSDSDSDSDSDSDSDFSDFSDSDF 部分。