Language Models (LMs) have been ubiquitously leveraged in various tasks including spoken language understanding (SLU). Spoken language requires careful understanding of speaker interactions, dialog states and speech induced multimodal behaviors to generate a meaningful representation of the conversation. In this work, we propose to dissect SLU into three representative properties:conversational (disfluency, pause, overtalk), channel (speaker-type, turn-tasks) and ASR (insertion, deletion,substitution). We probe BERT based language models (BERT, RoBERTa) trained on spoken transcripts to investigate its ability to understand multifarious properties in absence of any speech cues. Empirical results indicate that LM is surprisingly good at capturing conversational properties such as pause prediction and overtalk detection from lexical tokens. On the downsides, the LM scores low on turn-tasks and ASR errors predictions. Additionally, pre-training the LM on spoken transcripts restrain its linguistic understanding. Finally, we establish the efficacy and transferability of the mentioned properties on two benchmark datasets: Switchboard Dialog Act and Disfluency datasets.
翻译:语言模型(LMS)在包括口语理解(SLU)在内的各种任务中被普遍利用。口语要求仔细理解演讲者的互动、对话状态和语言诱导的多式联运行为,以产生有意义的对话表现。在这项工作中,我们建议将SLU分解为三种具有代表性的属性:交融(不稳定、暂停、超谈)、频道(语音类型、转折任务)和ASR(插入、删除、替代)。我们调查基于BERT的语言模型(BERT、RoBERTA),通过口语记录培训,以调查在没有任何语音提示的情况下理解多种特性的能力。经验性结果表明,LM在捕捉诸如暂停预测和从法律符号中超谈检测等谈话属性方面令人惊讶地出色。在下边,对口语调和ASR错误预测的LM评分低。此外,对LM语言评分的预先培训限制了语言理解。最后,我们确定上述属性在两个基准数据集上的有效性和可转让性:交换器对数据法和Disfrest。