Spoken language understanding (SLU) requires a model to analyze input acoustic signal to understand its linguistic content and make predictions. To boost the models' performance, various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text. However, the inherent disparities between the two modalities necessitate a mutual analysis. In this paper, we propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules. Besides conducting a self-supervised masked language modeling task on the two individual modules using unpaired speech and text, SPLAT aligns representations from the two modules in a shared latent space using a small amount of paired speech and text. Thus, during fine-tuning, the speech module alone can produce representations carrying both acoustic information and contextual semantic knowledge of an input acoustic signal. Experimental results verify the effectiveness of our approach on various SLU tasks. For example, SPLAT improves the previous state-of-the-art performance on the Spoken SQuAD dataset by more than 10%.
翻译:语言语言理解(SLU)需要一个模型来分析输入的声学信号,以了解语言内容并作出预测。为了提高模型的性能,提出了各种培训前方法,以学习大规模无注释演讲和文字的丰富表述。但是,两种模式之间的内在差异需要相互分析。在本文件中,我们提议了一个新型的半监督学习框架,即SALT,以共同对语言和语言模块进行预先培训。除了利用未节制的言词和文字对两个模块进行自我监督的蒙面语言建模任务外,SELT还利用少量配对的言词和文字,对两个模块在共享的隐蔽空间的表述进行校准。因此,在微调期间,单是语言模块就可以产生既包含声学信息又包含输入声学信号背景语义知识的表述。实验结果可以验证我们各种SLU任务的方法的有效性。例如,Spoken SqAD数据集的状态表现提高了10%以上。