Though Dialogue State Tracking (DST) is a core component of spoken dialogue systems, recent work on this task mostly deals with chat corpora, disregarding the discrepancies between spoken and written language.In this paper, we propose OLISIA, a cascade system which integrates an Automatic Speech Recognition (ASR) model and a DST model. We introduce several adaptations in the ASR and DST modules to improve integration and robustness to spoken conversations.With these adaptations, our system ranked first in DSTC11 Track 3, a benchmark to evaluate spoken DST. We conduct an in-depth analysis of the results and find that normalizing the ASR outputs and adapting the DST inputs through data augmentation, along with increasing the pre-trained models size all play an important role in reducing the performance discrepancy between written and spoken conversations.
翻译:虽然对话状态跟踪(DST)是口语对话系统的核心组件,但最近关于此任务的研究大多处理聊天语料库,忽略了口语和书面语之间的差异。在本文中,我们提出了OLISIA,这是一个级联系统,集成了自动语音识别(ASR)模型和DST模型。我们在ASR和DST模块中介绍了几种适应性调整,以改善与口语交谈的集成和鲁棒性。通过这些适应性调整,我们的系统在DSTC11 Track 3中排名第一,这是评估口语DST的基准。我们对结果进行了深入分析,发现归一化ASR输出和通过数据增强适应DST输入,以及增加预训练模型的大小,都在减少书面和口语交谈之间的性能差异方面发挥了重要作用。