The construction of open-domain dialogue systems requires high-quality dialogue datasets. The dialogue data admits a wide variety of responses for a given dialogue history, especially responses with different semantics. However, collecting high-quality such a dataset in most scenarios is labor-intensive and time-consuming. In this paper, we propose a data augmentation method to automatically augment high-quality responses with different semantics by counterfactual inference. Specifically, given an observed dialogue, our counterfactual generation model first infers semantically different responses by replacing the observed reply perspective with substituted ones. Furthermore, our data selection method filters out detrimental augmented responses. Experimental results show that our data augmentation method can augment high-quality responses with different semantics for a given dialogue history, and can outperform competitive baselines on multiple downstream tasks.
翻译:构建开放域对话系统需要高质量的对话数据集。 对话数据承认了对特定对话历史的广泛反应,特别是不同语义的响应。 但是,在大多数假设情景中收集高质量的这类数据集是劳动密集型和耗时的。 在本文中,我们提出了一个数据增强方法,通过反事实推理,自动增加不同语义的高质量反应。具体地说,在观察到的对话中,我们反事实生成模型首先推断出不同的响应,用替代词取代观察到的回答观点。此外,我们的数据选择方法过滤了有害的强化回应。实验结果显示,我们的数据增强方法可以增加高质量的反应,用不同的语义来描述特定对话历史,并且可以超越多个下游任务的竞争性基线。