The primary purpose of dialogue state tracking (DST), a critical component of an end-to-end conversational system, is to build a model that responds well to real-world situations. Although we often change our minds from time to time during ordinary conversations, current benchmark datasets do not adequately reflect such occurrences and instead consist of over-simplified conversations, in which no one changes their mind during a conversation. As the main question inspiring the present study, ``Are current benchmark datasets sufficiently diverse to handle casual conversations in which one changes their mind after a certain topic is over?'' We found that the answer is "No" because simply injecting template-based turnback utterances significantly degrades the DST model performance. The test joint goal accuracy on the MultiWOZ decreased by over 5\%p when the simplest form of turnback utterance was injected. Moreover, the performance degeneration worsens when facing more complicated turnback situations. However, we also observed that the performance rebounds when a turnback is appropriately included in the training dataset, implying that the problem is not with the DST models but rather with the construction of the benchmark dataset.
翻译:对话状态跟踪(DST)是端对端对话系统的一个关键组成部分,其主要目的是建立一个能很好地应对现实世界局势的模型。虽然我们经常在普通对话中不时地改变思维,但目前的基准数据集没有充分反映这种发生的情况,而是由过于简化的对话组成,在对话中没有人改变心智。作为本研究的主要启发问题,“当前基准数据集是否足够多样化,足以处理在某个主题结束后改变心智的偶然对话?”我们发现答案是“否”,因为仅仅输入基于模板的回溯语就会大大降低DST模式的性能。当注入最简单的回溯式时,多WOZ的测试联合目标准确性会下降5 ⁇ p。此外,在面对更复杂的回溯情况时,性能衰落会恶化。然而,我们还注意到,在培训数据集中适当包括回溯时,业绩会反弹,意味着问题与DST模型无关,而是与基准数据集的构建有关。