无奖励策略模仿学习在会话搜索中的应用 (Reward-free Policy Imitation Learning for Conversational Search)

Existing conversational search studies mainly focused on asking better clarifying questions and/or improving search result quality. These works aim at retrieving better responses according to the search context, and their performances are evaluated on either single-turn tasks or multi-turn tasks under naive conversation policy settings. This leaves some questions about their applicability in real-world multi-turn conversations where realistically, each and every action needs to be made by the system itself, and search session efficiency is often an important concern of conversational search systems. While some recent works have identified the need for improving search efficiency in conversational search, they mostly require extensive data annotations and use hand-crafted rewards or heuristics to train systems that can achieve reasonable performance in a restricted number of turns, which has limited generalizability in practice. In this paper, we propose a reward-free conversation policy imitation learning framework, which can train a conversation policy without annotated conversation data or manually designed rewards. The trained conversation policy can be used to guide the conversational retrieval models to balance conversational search quality and efficiency. To evaluate the proposed conversational search system, we propose a new multi-turn-multi-response conversational evaluation metric named Expected Conversational Reciprocal Rank (ECRR). ECRR is designed to evaluate entire multi-turn conversational search sessions towards comprehensively evaluating both search result quality and search efficiency.

翻译：现有的会话搜索研究主要集中在如何更好地提问以及/或者提高搜索结果质量。这些工作旨在根据搜索上下文检索更好的响应，并且它们的性能在单轮任务或者朴素对话策略下的多轮任务中进行评估。这留下了一些关于它们在实际的多轮对话中的适用性的问题，实际上，每个动作都需要系统本身来执行，并且搜索会话效率通常是会话搜索系统中重要的问题。虽然最近一些工作已经指出了需要改进会话搜索中的效率，但它们大多需要大量的数据注释，并利用手工设计的奖励或启发式规则来训练可以在有限轮次中实现合理性能的系统，在实践中具有有限的通用性。在本文中，我们提出了一种无奖励会话策略模仿学习框架，可以在不需要注释对话数据或手动设计奖励的情况下训练会话策略。训练后的会话策略可以用于指导会话检索模型平衡会话搜索质量和效率。为了评估所提出的会话搜索系统，我们提出了一种新的多轮多响应会话评估指标，名为预期会话互惠排名 (ECRR)。ECRR 旨在评估整个多轮会话搜索会话，以全面评估搜索结果质量和搜索效率。