End-to-end (E2E) spoken language understanding (SLU) can infer semantics directly from speech signal without cascading an automatic speech recognizer (ASR) with a natural language understanding (NLU) module. However, paired utterance recordings and corresponding semantics may not always be available or sufficient to train an E2E SLU model in a real production environment. In this paper, we propose to unify a well-optimized E2E ASR encoder (speech) and a pre-trained language model encoder (language) into a transformer decoder. The unified speech-language pre-trained model (SLP) is continually enhanced on limited labeled data from a target domain by using a conditional masked language model (MLM) objective, and thus can effectively generate a sequence of intent, slot type, and slot value for given input speech in the inference. The experimental results on two public corpora show that our approach to E2E SLU is superior to the conventional cascaded method. It also outperforms the present state-of-the-art approaches to E2E SLU with much less paired data.
翻译:端到端(E2E)口语理解(SLU)可以直接从语音信号中从语音信号中推断出语义语义直接,而不必将自动语音识别器(ASR)与自然语言理解模块(NLU)直接相联。然而,配对的语音记录和相应的语义可能并不总是可用或足以在真实生产环境中对E2E SLU模型进行培训。在本文中,我们提议将精选的E2E ASR编码器(speech)和预先培训的语言模型编码器(语言)合并为变压器。统一的语音预培训模型(SLP)通过使用一个有条件的隐蔽语言模型(MLM)目标,在一个目标域的有限标签数据上不断得到加强,从而能够有效地产生一种意向序列、位置类型和位置值,用于在推断中给出的输入演讲。两家公共公司实验结果表明,我们对E2E-E SLU的处理方法优于常规的级联方法。它也超越了目前状态的S2LE对E的配制方法。