对话搜索评价计量的元评价 (Meta-evaluation of Conversational Search Evaluation Metrics)

Conversational search systems, such as Google Assistant and Microsoft Cortana, enable users to interact with search systems in multiple rounds through natural language dialogues. Evaluating such systems is very challenging given that any natural language responses could be generated, and users commonly interact for multiple semantically coherent rounds to accomplish a search task. Although prior studies proposed many evaluation metrics, the extent of how those measures effectively capture user preference remains to be investigated. In this paper, we systematically meta-evaluate a variety of conversational search metrics. We specifically study three perspectives on those metrics: (1) reliability: the ability to detect "actual" performance differences as opposed to those observed by chance; (2) fidelity: the ability to agree with ultimate user preference; and (3) intuitiveness: the ability to capture any property deemed important: adequacy, informativeness, and fluency in the context of conversational search. By conducting experiments on two test collections, we find that the performance of different metrics varies significantly across different scenarios whereas consistent with prior studies, existing metrics only achieve a weak correlation with ultimate user preference and satisfaction. METEOR is, comparatively speaking, the best existing single-turn metric considering all three perspectives. We also demonstrate that adapted session-based evaluation metrics can be used to measure multi-turn conversational search, achieving moderate concordance with user satisfaction. To our knowledge, our work establishes the most comprehensive meta-evaluation for conversational search to date.

翻译：谷歌助理和微软科尔塔纳等连通搜索系统使用户能够通过自然语言对话在多轮中与搜索系统互动。评估这些系统非常具有挑战性,因为可以产生任何自然语言回应,用户通常会为完成搜索任务而进行多轮相互互动。虽然先前的研究提出了许多评价指标,但这些措施有效捕捉用户偏好的程度仍有待于调查。在本文件中,我们系统地对各种对话搜索指标进行元化评价。我们具体研究了关于这些指标的三种观点:(1)可靠性:发现“实际”业绩差异的能力,而不是偶然观察到的;(2)忠诚性:与最终用户偏好一致的能力;(3)直观性:掌握被认为重要的任何财产的能力:充分性、信息性和在谈话搜索过程中的流畅性。通过对两种测试收集进行实验,我们发现不同计量的性能在不同情景之间有很大差异,而与先前的研究一致,现有指标只能与最终用户偏好和满意度形成弱的关联性;METTEOR是相对地说,与最终用户偏好选择用户偏好;(2)真实性:满足最终用户偏好的能力;(3)直观:掌握最终用户偏好用户偏好;和最终用户偏好;以及直观:在谈话过程中,现有最精确度度度度度度度度度度度度上、最精确度度度度评估是现有一度度度度评估,我们使用的一次对三面面面面面面面面面面面面面面面面面面面面面面度,我们测量度评估,我们路段,衡量度评估。