Automatic evaluation metrics are a crucial component of dialog systems research. Standard language evaluation metrics are known to be ineffective for evaluating dialog. As such, recent research has proposed a number of novel, dialog-specific metrics that correlate better with human judgements. Due to the fast pace of research, many of these metrics have been assessed on different datasets and there has as yet been no time for a systematic comparison between them. To this end, this paper provides a comprehensive assessment of recently proposed dialog evaluation metrics on a number of datasets. In this paper, 23 different automatic evaluation metrics are evaluated on 10 different datasets. Furthermore, the metrics are assessed in different settings, to better qualify their respective strengths and weaknesses. Metrics are assessed (1) on both the turn level and the dialog level, (2) for different dialog lengths, (3) for different dialog qualities (e.g., coherence, engaging), (4) for different types of response generation models (i.e., generative, retrieval, simple models and state-of-the-art models), (5) taking into account the similarity of different metrics and (6) exploring combinations of different metrics. This comprehensive assessment offers several takeaways pertaining to dialog evaluation metrics in general. It also suggests how to best assess evaluation metrics and indicates promising directions for future work.
翻译:自动化评价指标是对话系统研究的一个关键组成部分。标准语言评价指标据知对评价对话来说是无效的。因此,最近的研究提出了若干与人类判断更相干的新颖的、针对具体对话的衡量标准。由于研究速度快,许多这些衡量标准都对不同的数据集进行了评估,还没有时间对不同的数据集进行系统比较。为此,本文件全面评估了最近提议的关于若干数据集的对话评价指标。在本文件中,对10个不同的数据集评价了23个不同的自动评价指标。此外,对指标进行了不同环境的评估,以更好地确定各自的强项和弱点。对指标的评估(1) 在转弯层次和对话层次都进行了评估,(2) 不同的对话长度,(3) 不同的对话质量(例如,一致性,参与),(4) 不同类型的反应生成模型(例如,基因化、检索、简单模型和最新模型),(5) 考虑了不同指标的相似性,(6) 探索不同指标的组合。对不同指标的强项和弱点进行了评估。这一全面评估还提出了未来评估的最有希望的方式。