Chatbots are designed to carry out human-like conversations across different domains, such as general chit-chat, knowledge exchange, and persona-grounded conversations. To measure the quality of such conversational agents, a dialogue evaluator is expected to conduct assessment across domains as well. However, most of the state-of-the-art automatic dialogue evaluation metrics (ADMs) are not designed for multi-domain evaluation. We are motivated to design a general and robust framework, MDD-Eval, to address the problem. Specifically, we first train a teacher evaluator with human-annotated data to acquire a rating skill to tell good dialogue responses from bad ones in a particular domain and then, adopt a self-training strategy to train a new evaluator with teacher-annotated multi-domain data, that helps the new evaluator to generalize across multiple domains. MDD-Eval is extensively assessed on six dialogue evaluation benchmarks. Empirical results show that the MDD-Eval framework achieves a strong performance with an absolute improvement of 7% over the state-of-the-art ADMs in terms of mean Spearman correlation scores across all the evaluation benchmarks.
翻译:聊天室的设计是为了在不同领域进行人性化对话,例如普通聊天、知识交流和人造对话。为了衡量这种对话媒介的质量,对话评价员预期也将进行跨领域的评估。然而,大多数最先进的自动对话评价指标(ADMs)并不是为多领域评价设计的。我们有志于设计一个普遍和健全的框架(MDD-Eval)来解决这个问题。具体地说,我们首先培训一名具有人称数据教师评价员,以获得在特定领域告诉坏对话反应的评级技能,然后采取自我培训战略,用教师附加说明的多领域数据培训新的评价员,帮助新的评价员在多个领域进行普及。MDD-Eval在六个对话评价基准上得到了广泛评估。经验性结果显示,MDDD-Eval框架取得了强有力的业绩,在所有评价基准中,Spearman相关性得分数方面,比国家数据库的7%的绝对改进。