Collecting sufficient labeled data for spoken language understanding (SLU) is expensive and time-consuming. Recent studies achieved promising results by using pre-trained models in low-resource scenarios. Inspired by this, we aim to ask: which (if any) pre-training strategies can improve performance across SLU benchmarks? To answer this question, we employ four types of pre-trained models and their combinations for SLU. We leverage self-supervised speech and language models (LM) pre-trained on large quantities of unpaired data to extract strong speech and text representations. We also explore using supervised models pre-trained on larger external automatic speech recognition (ASR) or SLU corpora. We conduct extensive experiments on the SLU Evaluation (SLUE) benchmark and observe self-supervised pre-trained models to be more powerful, with pre-trained LM and speech models being most beneficial for the Sentiment Analysis and Named Entity Recognition task, respectively.
翻译:为口语理解收集足够的标签数据既费钱又费时。最近的研究通过在低资源情况下使用经过预先培训的模型取得了大有希望的成果。受此启发,我们想问:哪些(如果有的话)培训前战略可以提高SLU基准的绩效?为了回答这个问题,我们为SLU采用了四种经过培训的模型及其组合。我们利用自我监督的语音和语言模型(LM)对大量未经培训的数据进行预先培训,以获得强有力的语音和文本表述。我们还探索使用经过监督的关于更大的外部自动语音识别(ASR)或SLU Corpora的事先培训模型。我们在SLU评价基准上进行广泛的实验,并观察经过自我监督的事先培训的模式,以便更强大。我们经过事先培训的LM和语言模型分别对传感器分析和命名实体识别任务最为有益。