Evaluating the quality of a dialogue system is an understudied problem. The recent evolution of evaluation method motivated this survey, in which an explicit and comprehensive analysis of the existing methods is sought. We are first to divide the evaluation methods into three classes, i.e., automatic evaluation, human-involved evaluation and user simulator based evaluation. Then, each class is covered with main features and the related evaluation metrics. The existence of benchmarks, suitable for the evaluation of dialogue techniques are also discussed in detail. Finally, some open issues are pointed out to bring the evaluation method into a new frontier.
翻译:评估对话系统的质量是一个研究不足的问题,最近评价方法的演变促使这次调查寻求对现有方法进行明确和全面的分析,我们首先将评价方法分为三类,即自动评价、与人有关的评价和基于用户模拟器的评价,然后将每一类都包括主要特点和相关的评价指标,还详细讨论了适合评价对话技术的基准的存在情况,最后指出一些尚未解决的问题,以便将评价方法纳入新的领域。