Practical dialog systems need to deal with various knowledge sources, noisy user expressions, and the shortage of annotated data. To better solve the above problems, we propose CGoDial, new challenging and comprehensive Chinese benchmark for multi-domain Goal-oriented Dialog evaluation. It contains 96,763 dialog sessions and 574,949 dialog turns totally, covering three datasets with different knowledge sources: 1) a slot-based dialog (SBD) dataset with table-formed knowledge, 2) a flow-based dialog (FBD) dataset with tree-formed knowledge, and a retrieval-based dialog (RBD) dataset with candidate-formed knowledge. To bridge the gap between academic benchmarks and spoken dialog scenarios, we either collect data from real conversations or add spoken features to existing datasets via crowd-sourcing. The proposed experimental settings include the combinations of training with either the entire training set or a few-shot training set, and testing with either the standard test set or a hard test subset, which can assess model capabilities in terms of general prediction, fast adaptability and reliable robustness.
翻译:为了更好地解决上述问题,我们建议CGoDial, 新的具有挑战性和综合性的中国多领域目标对对话框评估基准。它包含96,7663个对话会和574,949个对话会,完全翻转,覆盖三个数据集,包含不同的知识来源:(1) 基于时间档的对话(SBD)数据集,包含以表格为基础的知识;(2) 基于树形知识的基于流动的对话(FBD)数据集,以及基于检索的对话(RBD)数据集,包含候选人形成的知识。为了缩小学术基准与口语对话情景之间的差距,我们或者从真实对话中收集数据,或者通过众包将口语特征添加到现有的数据集中。提议的实验环境包括培训与整个培训组或几张式培训组相结合,以及用标准测试组或硬测试组进行测试,这些测试组既可以评估一般预测、快速适应性和可靠稳健的模型能力。