Most human interactions occur in the form of spoken conversations where the semantic meaning of a given utterance depends on the context. Each utterance in spoken conversation can be represented by many semantic and speaker attributes, and there has been an interest in building Spoken Language Understanding (SLU) systems for automatically predicting these attributes. Recent work has shown that incorporating dialogue history can help advance SLU performance. However, separate models are used for each SLU task, leading to an increase in inference time and computation cost. Motivated by this, we aim to ask: can we jointly model all the SLU tasks while incorporating context to facilitate low-latency and lightweight inference? To answer this, we propose a novel model architecture that learns dialog context to jointly predict the intent, dialog act, speaker role, and emotion for the spoken utterance. Note that our joint prediction is based on an autoregressive model and we need to decide the prediction order of dialog attributes, which is not trivial. To mitigate the issue, we also propose an order agnostic training method. Our experiments show that our joint model achieves similar results to task-specific classifiers and can effectively integrate dialog context to further improve the SLU performance.
翻译:大多数人类交互以口头对话的形式进行,其中给定话语的语义意义取决于上下文。口头对话中的每个话语都可以用许多语义和说话人属性表示,因此自动预测这些属性的口语理解(SLU)系统备受关注。最近的工作表明,融入对话历史可以帮助提高SLU的性能。然而,每个SLU任务都使用单独的模型,导致推理时间和计算成本增加。出于此原因,我们的问题是:我们可以同时建模所有的SLU任务,同时融入上下文以促进低延迟和轻量级推理吗?为了回答这个问题,我们提出了一种新颖的模型架构,该架构学习对话上下文,以共同预测口语话语的意图、对话行为、说话人角色和情感。请注意,我们的联合预测基于一个自回归模型,并且我们需要决定对话属性的预测顺序,这并不容易。为了减轻这个问题,我们还提出了一种无顺序训练方法。我们的实验表明,我们的联合模型实现了与任务特定分类器类似的结果,并可以有效地整合对话上下文,进一步提高SLU的性能。