Dialogue state trackers have made significant progress on benchmark datasets, but their generalization capability to novel and realistic scenarios beyond the held-out conversations is less understood. We propose controllable counterfactuals (CoCo) to bridge this gap and evaluate dialogue state tracking (DST) models on novel scenarios, i.e., would the system successfully tackle the request if the user responded differently but still consistently with the dialogue flow? CoCo leverages turn-level belief states as counterfactual conditionals to produce novel conversation scenarios in two steps: (i) counterfactual goal generation at turn-level by dropping and adding slots followed by replacing slot values, (ii) counterfactual conversation generation that is conditioned on (i) and consistent with the dialogue flow. Evaluating state-of-the-art DST models on MultiWOZ dataset with CoCo-generated counterfactuals results in a significant performance drop of up to 30.8% (from 49.4% to 18.6%) in absolute joint goal accuracy. In comparison, widely used techniques like paraphrasing only affect the accuracy by at most 2%. Human evaluations show that COCO-generated conversations perfectly reflect the underlying user goal with more than 95% accuracy and are as human-like as the original conversations, further strengthening its reliability and promise to be adopted as part of the robustness evaluation of DST models. Code is available at https://github.com/salesforce/coco-dst.
翻译:州级对话跟踪者在基准数据集方面取得了显著进展,但是他们对于超脱对话之外新颖和现实情景的概括化能力却不那么为人所理解。 我们提议控制反事实(CoCo),以弥合这一差距,并评价关于新情景的对话状态跟踪模式,即如果用户反应不同,但仍与对话流一致,系统能否成功应对请求? COo利用翻转层面的信念,将反事实性条件分为两个步骤,产生新的对话情景:(一) 在转折层面,通过降低和增加空档,继而取代空档值,从而实现反现实目标。 (二) 反事实性对话生成,以(一)为条件,并与对话流相一致。 评估多WoZ数据集中最先进的DST模型,如果与Coco生成的反事实相适应,系统是否会成功满足30.8%(从49.4%到18.6%)的绝对联合目标准确性。 相比之下,广泛使用的技术,例如对调频位,仅影响精度,然后取代空位值的准确度值,(二分点), 人类评估将更准确地反映C- CO- 的准确性对话作为原始的准确性,作为原始代码基础,作为核心的准确性,作为基础,可以进一步反映。