Spoken language understanding (SLU) is a task aiming to extract high-level semantics from spoken utterances. Previous works have investigated the use of speech self-supervised models and textual pre-trained models, which have shown reasonable improvements to various SLU tasks. However, because of the mismatched modalities between speech signals and text tokens, previous methods usually need complex designs of the frameworks. This work proposes a simple yet efficient unsupervised paradigm that connects speech and textual pre-trained models, resulting in an unsupervised speech-to-semantic pre-trained model for various tasks in SLU. To be specific, we propose to use unsupervised automatic speech recognition (ASR) as a connector that bridges different modalities used in speech and textual pre-trained models. Our experiments show that unsupervised ASR itself can improve the representations from speech self-supervised models. More importantly, it is shown as an efficient connector between speech and textual pre-trained models, improving the performances of five different SLU tasks. Notably, on spoken question answering, we reach the state-of-the-art result over the challenging NMSQA benchmark.
翻译:口语理解(SLU)是一项任务,目的是从口语中提取高层次的语义理解(SLU) 。 先前的工作调查了语言自我监督模式和文本预培训模式的使用,这些模式和文本自我监督预培训模式显示了各种语言任务的合理改进。 但是,由于语言信号和文本象征之间的模式不匹配,以往的方法通常需要复杂的框架设计。 这项工作提出了一个简单而有效的、不受监督且不受监督的模式,将语言和文本预培训模式联系起来,从而形成一种不受监督的语音到文本预培训模式,从而导致SLU各种任务采用未经监督的语音预培训模式。 具体地说,我们建议使用不受监督的自动语音识别(ASR)作为连接器,连接语音信号和文本预培训模式中使用的不同模式。 我们的实验表明,未经监督的ASR本身可以改进语言自我监督模式的表达方式。 更重要的是,它被显示为语言和文本预培训模式之间的高效连接器,改进了SLU五项不同任务的业绩。 明显地说,我们在口头回答时,我们面对了国家基准质量的结果。