Conversational search systems, such as Google Assistant and Microsoft Cortana, provide a new search paradigm where users are allowed, via natural language dialogues, to communicate with search systems. Evaluating such systems is very challenging since search results are presented in the format of natural language sentences. Given the unlimited number of possible responses, collecting relevance assessments for all the possible responses is infeasible. In this paper, we propose POSSCORE, a simple yet effective automatic evaluation method for conversational search. The proposed embedding-based metric takes the influence of part of speech (POS) of the terms in the response into account. To the best knowledge, our work is the first to systematically demonstrate the importance of incorporating syntactic information, such as POS labels, for conversational search evaluation. Experimental results demonstrate that our metrics can correlate with human preference, achieving significant improvements over state-of-the-art baseline metrics.
翻译:谷歌助理和微软科尔塔纳等连通搜索系统提供了一个新的搜索模式,允许用户通过自然语言对话与搜索系统沟通。评估这些系统非常具有挑战性,因为搜索结果以自然语言句的形式出现。鉴于可能的答复数量有限,收集所有可能答复的关联性评估是不可行的。在本文中,我们提出POSCORE,这是一个简单而有效的对话搜索自动评价方法。提议的嵌入基度指标在回应中考虑到语言术语部分的影响。在最先进的知识中,我们的工作是首先系统地表明将合成信息(如POS标签)纳入谈话搜索评估的重要性。实验结果表明,我们的衡量标准可以与人类偏好相关,大大改进了最先进的基线衡量标准。